Willkommen auf Eva Winterschön's site für Erinnerungen und Reflexionen
Eva Winterschön

reflections on OSS, HPC, and Ai/ML engineering, with occasional considerations on Cognitive Neuroscience

    System Migration Complete, and Now it's Cherry Time

    Sunday offers an EOD and EOW wrap up thoughts, with some meandering grammatical expressionism. Present state of this week’s hardware resource aggregation into one of the LLM research boxes is… 98%. The data migration from earlier today is complete, with ZFS offering some enjoyable numbers.

    zpool scrub cranking out 19.6G/s, with 2.74T scanned at approx 2min, with the pool predicting a total of 38min to complete

    While building the spec and futzing with the BOM, one of my priorities was to ensure zero PCIe lane contention, with I/O going direct to 64 cores at 3.5GHz peaks, and eight memory channels in use (4x ECC-REG + 4x NVDIMM). All of those conditions have been benchmarking well so far; yay for meeting personally set expectations. 🤗

    I’d say this is mostly pretty decent for a draid1 array with NVMe on a single-socket Xeon ICX box, and so far much better than the dual socket Ice Lake boxes that I was benching at ${OldOrg}. The recurring two-socket issue on this Xeon generation, and on its successor Sapphire Rapids, is that there’s too much cross-socket comms unless specific choices are made with PCIe peripherals and specific choices with BIOS tuning and kernel settings and core-pinning; all of which can be time consuming at a minimum, and head scratching with “Just call Premier Support!” at the far end, unless you know better.

    This should not come as a surprise to anyone who has been benchmarking multi-socket Xeons ever since Xeons were brought into existence. Having had that experience, and because “data + facts + personal experience == real answers”, it was simple to predict performance concerns related to cross-socket latency, and don’t forget PCIe slot to socket sequencing and NUMA domain crossing, especially when optimizing for high-throughput network based storage transfers. If you scaled a 1 socket to 2 to 4 on the old Nehalem-EX based Xeons, then you know what I mean - they could never match the classic Big Iron vendors' quad-socket designs like Sun’s SPARC T3 and M3 series, and certainly not competitive to IBM’s POWER7 options.

    Anyway, there’s a little game to play in storage server land called, “high density rack-scale storage is more efficient with single socket boards, using a high-lane count, high-clock, big cache SKUs”. This isn’t the time for a lesson in predictive algorithms necessary for resource optimization, ones which should be used when determining cost/benefit analysis of horizontal vs vertical (or combined) storage system scaling at the rack-level; but it is time for me to get off of the computer and eat some cherries and watch one of the alternate endings for Cyberpunk 2077 Phantom Liberty!

    And really, for a Sunday evening, remaining at this standing desk in the office-lab is not appealing, particularly due to the “it’s just too damn hot for pants!" environment, even with the A/C on full blast (common metric for this lab, though it’s fine weather for a skirt).

    PS, the pool scub finished with zero errors. I am pleased. Migration complete.

    #linux #zfs #xeon #intel #ibm #sunmicrosystems #storage #hardware #engineering #memories

    Need more PCIe slots, not more lanes, but slots.

    Staring at this board, mentally converting PCIe lane count to device requirements, only to consider that yet again…

    It’s 2025. We’re still using the PCIe slot standard for most motherboards, even on enterprise boards. Who’s doing this right, at least marginally so?

    ASRack (brain, please grow up and stop making elementary jokes) has been putting four physical x16 slots on their “Deep mini-ITX” series of server boards, to which are added two x8 lane Slim-SAS and two x4 lane OcuLink headers. These can be used for a variety of SFF fan-out connection purposes.

    Extensions like this include:

    • multi-drive backplane headers for hot-swap 2.5", 3.5", U.2, U.3 bays
    • additional M.2 slots using a SFF backed PCB
    • conversions from one standard controller to another, PCIe to SATA3 or SAS3
    • adding external ports for SAN and DAS expansions
    • additional physical slots in x4, x8, or x16 - just mount those style connectors in an adjacent or remote chassis zone via PCIe 1:1 extension or PCIe switch

    There are likely other examples, and I’m always interested in finding more.

    #servers #baremetal #linux #freebsd #motherboard #engineering #hardware

    The demise of Fedora as a Developer Distro

    The Q4-2024 through Q2-2025 cycle has been one of remarkable instability for Fedora kernels and systemd breaking-changes. F40 through F42 have offered more crashes, more kernel panics, and more core dumps than any triple-set of versions since its dawning into existence - and I’ve been running this distro for dev-only boxes since the first release.

    In the worst of circumstances, Fedora notified me of a firmware update, which was a hands-off type of “reboot to install” EFI blob which bricked my Thinkpad X1 Nano laptop just last week. That’s an expensive failure and one which has no simple resolution (many hours, many iso to img to efi/cap conversions, and maybe soon a WSOP-8 NAND reprogramming).

    I’ve been all-in on the RedHat infra train since 1999, having used it at global backbone providers, satellite infra providers, hpc clusters, media conglomerates, gaming networks, streaming services, cloud providers and hyperscalers, and just about everything in-between… and while I have my qualms on certain technical aspects and corporate strategy over those years, I still consider it one of my favorite Linux distos (top five at least), and it’s my first choice for enterprise Linux. So, I’m not just hating on the ecosystem here, I’m a vested user who’s been disappointed too many times by the direction

    Unfortunately, with Fedora, I stand behind my position that Redhat has been too hands-off in conducting standards requirement alignment, in requiring a base level of stability upon which serious engineering development can be done, without having to clean up other people’s messes every few weeks.

    It’s just not a great dev platform distro anymore, and yet the org markets the name as if it were a valid OS for all manner of production uses. IoT use? Never, no thanks. Edge devices? Never again. Security? Forget about it.

    It’s not a “bleeding-edge issue” - as that’s Rawhide’s territory where stability of any sort is not expected. The evolving issue with Fedora is that it is uncommonly unstable for a “dev env”, and one which ships far too much untested code and pushes the equivalent of “crowd source fail-tests” where they expect the users to “just tell us if it doesn’t work”. Their release engineering and load-testing seems to occur with zero multivariate matrices sequences for all manner of performance, validation, and functional tests. I suppose they consider unit testing to be “full coverage”, but it is not. Happy to be corrected here, though I will remain disappointed and avoid any further deployments or firmware updates from Fedora.

    #linux #fedora #releng #engineering #software #enterprise #rhel #distros #displeasure

    ...

    🌐 Weekend Hardware Update 🌐

    Some additional reorganization time this weekend, all busy with the process of preparing the HomeLab for another interstate relocation. Having decided to repurpose some hardware to create a new router, this one formerly hosted infra VMs several years ago, but now it’s going to run a pair of “BSD Router Project” VMs for on-host H/A, and a bit of packet shuffling between hosts for RDMA on RoCE v2, iSCSI, and a few shared NFS mounts for compute hosts, along with the usual firewall and packet filtering, some Wireguard, OpenVPN, and Zerotier, NTP + PTP via GPS/GNSS.

    Here’s a quick rundown on the specs, aggregated from the parts shelves and “it’s not actively being used elsewhere” collection of enterprise hardware.

    #homelab #freebsd #router #networking #engineering #hardware #servers

    ...

    🗑️ Import Your Failures! 🗑️

    While using Soundcloud the other day, prompted by a little helper fellow which states:

    “Import your Spotify Playlists!”

    ok, sure. I’d like to use a single app for music if possible, and I prefer Soundcloud to Spotify in general. So, I click ok and authorize the apps' APIs to do their thing.

    It finishes importing. Great, back to work. Life goes on. Days later, that being five minutes ago, sitting at my desk… “time for some metal” and decide to play a near-daily favorite, “Heaven Shall Burn” compilation playlist which consists only of their music.

    Here’s what Soundcloud decided to import and use for the Heaven Shall Burn playlist (attached image). Look at the artist names, the album covers, the lack of sequential existence of tracks which should all be on the same several-albums. These aren’t even the same category of music, let alone the same band or albumns.

    Soundcloud, you have failed. This was an easy one; you have the API calls with the correct track and author IDs right there in the Spotify playlists to do the import calls on a 1:1 basis, but no, instead you’ve taken a giant pile of garbage and put it on my Library.

    Instead of “Protector” by Heaven Shall Burn, you give me "Protector" by Beyonce. Trust me, they are not equivalent. This is garbage, more enshittification of the internet by unchecked unvalidated coding styles which prioritize “we ship code always” over “we ship when code is ready”.

    Now I’m left with extra work to do, to clean up their mess of shitty playlists, and the tool imported A LOT of playlists.

    #music #streaming #steamingPileofSoftware #software #engineering #antipattern #trash #code #developers #enshittifcation #spotify #soundcloud #api #fail

    ...

    💻 Mozilla Thunderbird - The Failure of a Once Great Client 💻

    Thunderbird, wtf are you doing where you need 83% CPU plus ~50GB (virt) and 20G (res) of RAM allocated?

    The entire mailbox isn’t even 1GB in total size. Yet here it is, cranking out all of this nebulous processing 24/7… literally 24/7 it’s running loads like this for months now. Generally I ignore it because the workstation has a modern EPYC with 32 cores and 128GB of RAM, so it’s not completely crippling my usage… but sometimes it does, and sometimes I want that 50/20GB split to not be around, and I want those clock cycles and L1/L2 cache doing something useful.

    This issue has persisted over several major versions, it’s not a new issue. I’ve gone through all of the usual steps, removing all themes and plugins, etc. Nothing fixes the problem, and there’s nothing useful in its logs. Uninstall and reinstall, use on different computers, use with different server settings, use on different domain accounts, it doesn’t matter… Thunderbird always runs itself into failure mode eventually.

    I don’t want to dtrace it, I don’t want to strace it, I don’t want to deal with this at all because email is not complicated enough to warrant such investment of my engineering time or focus.

    I hate what Thunderbird has become, and I’ve been using it since version 1.0 way back in the 2000s. I’ve used it on all of the operating systems upon which it’s been supported. This has become absurd, the slow death of a once great product.

    You’re an email client. You handle IMAP. This resource usage is unacceptable.

    #opensource #mozilla #thunderbird #email #slow #debugging #programming #linux #freebsd #desktop

    ...

    💾 Updates on Remote Access to POWER9 💾

    Some quick progress notes for a few FreeBSD and OpenZFS devs, which will be accessing one of my POWER9 systems.

    See photos for visual reference.

    • Separate L1 and L2 domain for all network access, zero-shared DMZ, unconnected to the rest of the RFC99 lab lack
    • Mikrotik router has been temporarily slapped into place on the side of my lab rack, directly in back of the Talos II system (dual socket, 144 threads of fun!), using PoE for the main link
    • PiKVM v3 is being cabled up for a jump box, will connect to the Mikrotik switch ports
    • Talos II’s OpenBMC will be connected to the Mikrotik switch ports, offering iKVM and ipmitool SoL terminal, which will be accessible via the jumpbox
    • Wireguard tunnels will be allocated on the Mikrotik router for remote users
    • APC PDU for remote power cycling (hard reboot style) of the Talos II is accessible via SNMP v3 authenticated command

    Later today on the “Bhyve Production Users” meeting, I’ll go over those notes.

    @dexter@bsd.network

    #freebsd #openzfs #power9 #engineering #opensource #networking #linux

    ⚒️ Linux Kollektivs - Oh Please ⚒️

    I love a lot about the linux world, and there’s a lot of positive aspects to the kernel itself, which has brought a lot of joy over the years, with kernel module development, debugging, tuning, analyzing… specifically for my career, but also on a personal level. For one quarter of a century I’ve been using Linux and BSDs in many different flavors, but there’s an accelerating trend in the corporatized linux space which has been quite concerning.

    The marketing trends used by The Big Names over the past decade remind me far too much of Cold War propaganda, of the lies from literal communists which stained mine and many other’s youthful years. Those lies fell apart in 1989 and we all hoped that would be the end of it… but of course not.

    The Perception

    Das Kollektiv der Gemeinschaft! Wir sind für Sie da! Schließt euch den Reihen des Proletariats an und vernichtet den Menschen!

    Translated, loosely

    The collective of the community! We are here for you! Join the ranks of the proletariat and destroy the Man!

    I do not want your communism. I do not want your pseudo-cooperatives. I do not want your kollektiv of anything.

    We’ve had our community this whole time, and the corporate influence was supposed to be secondary. We had to fight and push to get OSS into the corporate sphere for decades, and sure enough, people take that for granted.

    There used to be an alignment where OSS was to “be against the corporate overloads of closed-source”, about “freedom of choice and implementation”, about “fair use licenses”. This was in opposition to Microsoft, Oracle, ATT unix, SCO unix, etc. But we all know what happened. The big names bought their way in, pumped up all the money making schemes, and now we see this trite marketing where billion dollar companies get to role play like they care about the communities which WE built over the past three+ decades.

    Don’t believe me? Go look at the recent Fedora website redesign. Look at Canonical. There are many examples and one can find them easily by perusing the Distro-Watch list of a 100+ remakes.

    I’m in no way against corporate interest and benefit when it’s ethical, when it’s up-front and honest about its intentions. Though they are not my favorites of the world, at least Redhat and Oracle don’t try to pull one over on the user with their main sites – it’s obvious that they are there for corporate players and for profitability, which is fine. I’ve paid for their licenses, for their hardware, many times in many ways, from personal use to license structures while operating a startup, so I’m not anti-corporation in the overall sense.

    However, this approach to “we’re all in this together” fake messaging is tiresome. The irony of Fedora being owned by Redhat, who is owned by IBM; draw whatever conclusions one wants there, but they are all managed by different sectors.

    So then, do we really need to be sold on the community aspect of what is inherently a community of developers working together for a shared goal? No, we do not, those are supposed to be intrisic and inherent qualities without need of mention.

    Who does it better? Here’s a non-exhaustive short list from the top of my mind.

    Don’t get me going on the BSDs… they are perfect and I love them, so it’s a totally different conversation. 💗