Planet NoName e.V.

2021-11-28

sECuREs website

MacBook Air M1: the best laptop?

You most likely have heard that Apple switched from Intel CPUs to their own, ARM-based CPUs.

Various early reviews touted the new MacBooks, among the first devices with the ARM-based M1 CPU, as the best computer ever. This got me curious: after years of not using any Macs, would an M1 Mac blow my mind?

In this article, I share my thoughts about the MacBook Air M1, after a year of occasional usage.

MacBook Air M1

Energy efficiency

The M1 CPU is remarkably energy-efficient. This has two notable effects:

  1. The device does not have a fan, and stays absolutely quiet. This is pretty magical, and I now notice my ThinkPad’s fan immediately.
  2. The battery lasts many hours, even with demanding use-cases like video conferencing.

When it comes to energy efficiency, Apple sets the bar. All other laptops should be fanless, too! And the battery life really is incredible: taking notes in Google Docs (via WiFi) while at a conference for many hours left me with well over 80% of battery at the end of the day!

I briefly lent the computer to someone and got it back with a VPN client installed. The battery life was considerably shortened by that VPN client and recovered once I uninstalled it. So if you’re not seeing great battery life, maybe a single program is ruining your experience.

The fast wakeup feature that was heavily stressed during the initial introduction (to some ridicule) is actually pretty nice! I now notice having to wait for my ThinkPad to wake up.

Battery life during standby is great, too. Anecdotally, when leaving my ThinkPad lying around, it never survives until I plug it in again. The MacBook survives every single time.

Chipset advantage?

Now, given that Apple controls the entire machine, does that mean they now offer features that other computers cannot offer yet?

My personal bar for this question is whether a computer can be used with my bandwidth-hungry 8K monitor, and the disappointing news is that the MacBook Air M1 cannot drive the 8K monitor with its 7680x4320 pixels resolution (at 60 Hz, using 2 DisplayPort links), not even with an external USB-C dock.

Maybe future hardware generations add support for 8K displays, but for my day-to-day, Apple’s complete control doesn’t improve anything.

Built-in peripherals

The screen is great! Everything looks sharp, colors are vibrant and brightness is good.

As usual, the touchpad (which Apple calls “trackpad”) is great, much better than any touchpad I have ever used on a PC laptop. Apple trackpads have always had this advantage since I know them, and I don’t know why PC touchpads don’t seem to get any better? 🤔

Apple brought back their scissor mechanism keyboards, which is a very welcome change. I have witnessed so so many problems with the old butterfly mechanism keyboards.

This first MacBook Air M1 model has no MagSafe. Apple added MagSafe in the MacBook Pro M1 in late 2021. I hope they’ll eventually expand MagSafe to all notebooks.

Peripherals: not enough ports

Staying in peripheral-land, let me first state that this MacBook’s 2 USB-C ports are not enough!

When working on the go, after plugging in power, I can plug in a wired ethernet adapter (wireless can be spotty), but then won’t have any ports left for my ergonomic keyboard and mouse.

For video conferencing, I can plug in power (to ensure I won’t run out of battery), connect a table microphone, but won’t have any ports left for a decent webcam. This is particularly annoying because this MacBook’s built-in webcam is really bad, and the main reason why reviewers don’t give the MacBook a perfect score (example review on YouTube).

So, in practice, you need to carry a USB-C dock, or at least a USB hub, with your laptop when you anticipate possibly needing any peripherals. #donglelife

Not enough RAM for local software development

Hardware-wise, the biggest pain point for software developers is the small amount of RAM: both the MacBook Air M1 and the MacBook Pro M1 (13") can be configured with up to 16 GB of RAM. Only the newer MacBook Pro M1 14" or 16" (introduced late 2021) support more RAM.

To be clear, 16 GB RAM is enough to do software development in general, but it can quickly become limiting when you deal with larger programs or data sets.

In my ThinkPad, I have 64 GB of RAM, which allows for a lot more VMs, large index data structures, or just plenty of page cache. With the ThinkPad, I don’t have to worry about RAM.

Of course, there are strategies around this. Maybe your projects are large enough to warrant maintaining a remote build cluster, and you can run your test jobs in a staging environment. The MacBook makes for a fine thin client — provided your internet connection is fast and stable.

Operating System: macOS

I am talking about Operating Systems at a very high level in this section. Many use-cases will work fine, regardless of the Operating System one uses. I can typically get by with a browser and a terminal program.

So, this section isn’t a nuanced or fair review or critique of macOS or anything like that, just a collection of a few random things I found notable while playing with this device :)

My favorite way to install macOS is Internet Recovery. You can install a blank disk in your Mac and start the macOS installer via the internet! The Mac will even remember your WiFi password. The closest thing I know in the PC world is netboot.xyz, and that needs to be installed in your local network first.

Similarly, Apple’s integration when using multiple devices seems pretty good. For example, the Mac will offer to switch to your iPhone’s mobile connection when it loses network connectivity.

But, just like in all other operating systems, there is plenty in macOS to improve.

For example, software updates on the Mac still take 30 minutes (!) or so, which is entirely unacceptable for such a fast device! In particular, Apple seems to be (partially?) using immutable file system snapshots to distribute their software, so I don’t know why distri can install and update so much faster.

Speaking of Operating System shortcomings, I have observed how APFS (the Apple File System) can get into a state in which it cannot be repaired, which I found pretty concerning! Automated and frequent backups of all on-device data is definitely a must.

Slow software updates are annoying, and having little confidence in the file system makes me uneasy, but what’s really a dealbreaker is that my preferred keyboard layout does not work well on macOS: see Appendix A: NEO keyboard layout.

Linux? 🐧

So given my preference for Linux, could I just use Linux instead?

Unfortunately, while Asahi Linux is making great progress in bringing Linux to the M1 Macs, it seems like it’ll still be many months before I can install a Linux distribution and expect it to just work on the M1 Mac.

Until then, check out the Asahi Linux Progress Report blog posts!

Intel to M1 architecture transition

Apple developed the Rosetta 2 dynamic binary translator which transparently handles non-M1 programs, and so far it seems to work fine! All the things I tried just worked, and architecture never seemed to play a role during my usage.

Conclusion

The MacBook Air M1 is indeed impressive! It’s light, silent, fast and the battery life is amazing. If these points are the most important to you in a laptop, and you’re already in the Mac ecosystem, I imagine you’ll be very happy with this laptop.

But is the M1 really so mind-blowing that you should switch to it no matter what? No. As a long-time Linux user who is primarily developing software, I prefer my ThinkPad X1 Extreme with its plentiful peripheral connections and lots of RAM.

I know it’s not an entirely fair comparison: I should probably compare the ThinkPad to the newer MacBook Pro models (not MacBook Air). But I’m not a professional laptop reviewer, I can only speak about these 2 laptops that I found interesting enough to personally try.

Appendix A: NEO keyboard layout

The macOS implementation of the NEO keyboard layout has a number of significant incompatibilities/limitations: its layer 3 does not work correctly. Layer 3 contains many important common characters, such as / (Mod3 + i, i.e. Caps Lock + i) or ? (Mod3 + s).

I installed the current neo.keylayout file (2019-08-16) as described on the NEO download page.

In order to make / and ? work in Google Docs, I had to enable the additional Karabiner rule “Prevent all layer 3 keys from being treated as option key shortcut” (see also: this GitHub issue)


I encountered the following issues, ordered by severity:

Issue 1: I cannot use Emacs at all! I installed the emacsformacosx.com version (also tried homebrew), but cannot enter keys such as / or ?. Emacs interprets these as M-u instead.

The Karabiner rule “Prevent all layer 3 keys from being treated as option key shortcut” that fixed this issue in Google Docs does not help for Emacs. Removing it from Karabiner changes behavior, but Emacs still recognizes M-i instead of /, so it’s broken with or without the rule.

Issue 2: In the Terminal app, I cannot enable the “Use Option as Meta key” keyboard option, otherwise all layer 3 keys function as meta shortcuts (M-i) instead of key symbols (/).

I commonly use the Meta key to jump around word-wise: Alt+b / Alt+f on a PC. Since I can’t use Option + b / Option + f on a Mac, I need to use Option + arrow keys instead, which works.

Since the Option key does not work as Meta key, I need to press (and release!) the Escape key instead. This is pretty inconvenient in Emacs in a terminal.

Issue 3: In Gmail in Chrome, the search keyboard shortcut (/) is not recognized.

I reported this problem upstream, but there seems to be no solution.


I’m not sure why these programs don’t work well with NEO. I tried BBEdit for comparison, and it had no trouble with (macOS-level) shortcuts such as command + / and option + command + /.

On Linux, the NEO layout works so much better. I’m really not in the mood to continuously fight with my operating system over keyboard input and shortcuts.

at 2021-11-28 15:50

2021-11-20

RaumZeitLabor

Habemus Hygienekonzept

Knapp ein Jahr nach unserem Einzug, haben wir uns entschieden, das neue RZL für Besuche zu öffnen.

Leider nicht mir einer berauschenden Einweihungsfeier, aber die holen wir nach sobald es geht. Versprochen!

Unser aktuelles Hygienekonzept, das wir gegebenenfalls anpassen werden, findet ihr hier.

Kurzgesagt: 2G, Nachweise erforderlich, Check-In via CWA, Selbsttest vor Besuchen und Maske tragen wird empfohlen.

Bitte kommt nicht, wenn ihr euch krank fühlt oder von einem möglichen Kontakt mit einer covid-positiven Person wisst! Schützt euch und uns!

Bei Veranstaltungen können weitere Regelungen getroffen werden – informiert euch am besten direkt vor eurem Besuch auf der Webseite unter Events oder wendet euch bei Unklarheiten an den Vorstand.

Solltet ihr das erste Mal ins RZL kommen wollen, empfiehlt sich wieder der Dienstagabend mit der „Offenen RaumZeitLaborierung“. Um nicht vor verschlossener Tür zu stehen, solltet ihr euch vorab nach Möglichkeit trotzdem kurz anmelden.

by flederrattie at 2021-11-20 00:00

2021-11-03

michael-herbst.com

Surrogate models for quantum spin systems based on reduced order modeling

The simulation of quantum spin models is an actively researched field. Albeit rather basic these many-body systems are inherently strongly correlated and as such feature a rich variety of phaenomena including involved patterns of ordering / discordering, topological order or varieties of phase changes. Furthermore these models often provide a good approximation to the low-temperature regime of real physical systems justifying their detailed study. One approach is to consider parametrised quantum spin models as a low-complexity proxy for real systems and use them to understand which parameter values (e.g. which spin coupling strengths) lead to interesting behaviours. From this one can deduce inversely how novel materials ought to be designed in order to probe and study these behaviours experimentally.

In a recent work my mentor Benjamin Stamm and myself teamed up with Stefan Wessel (RWTH physics department) and Matteo Rizzi (Universität Köln, Forschungszentrum Jülich) to work on cheap surrogate models for accelerating the study of such parametrised quantum spin models. Our key assumption is that the Hamiltonian of these models as well as the deduced quantities of interest (e.g. the structure factor) can be decomposed affinely in the parameters. For many standard models this is indeed the case. Exploiting the affine structure of the Hamiltonian our approach constructs a reduced-basis surrogate, which effectively represents the full problem in a basis of the exact solutions at a carefully chosen set of parameter values. As we demonstrate for two examples (a chain of Rydberg atoms as well as a sheet of coupled triangles) the information in relatively small reduced bases, which are orders of magnitude smaller than the dimensionality of the Hilbert space, sufficient information is accumulated by the reduced basis in order to reproduce key quantities of interest over the full parameter domain to an absolute error of 10⁻⁴ or less.

For me this was the first time working with quantum spin models. Even more so I enjoyed this interdisciplinary collaboration and the associated diving into a new subject in the discussions we had. Along the work on this paper we actually identified a number of possibilities for future work. In fact a number of the problems typically encountered when numerically modelling quantum spin models (e.g. due to highly degenerate ground states or issues with the iterative eigensolvers) are closely related to the challenges for modelling difficult quantum-chemical systems.

The full abstract of our paper reads

We present a methodology to investigate phase-diagrams of quantum models based on the principle of the reduced basis method (RBM). The RBM is built from a few ground-state snapshots, i.e., lowest eigenvectors of the full system Hamiltonian computed at well-chosen points in the parameter space of interest. We put forward a greedy-strategy to assemble such small-dimensional basis, i.e., to select where to spend the numerical effort needed for the snapshots. Once the RBM is assembled, physical observables required for mapping out the phase-diagram (e.g., structure factors) can be computed for any parameter value with a modest computational complexity, considerably lower than the one associated to the underlying Hilbert space dimension. We benchmark the method in two test cases, a chain of excited Rydberg atoms and a geometrically frustrated antiferromagnetic two-dimensional lattice model, and illustrate the accuracy of the approach.· In particular, we find that the ground-state manifold can be approximated to sufficient accuracy with a moderate number of basis functions, which increases very mildly when the number of microscopic constituents grows --- in stark contrast to the exponential growth of the Hilbert space needed to describe each of the few snapshots. A combination of the presented RBM approach with other numerical techniques circumventing even the latter big cost, e.g., Tensor Network methods, is a tantalising outlook of this work.

by Michael F. Herbst at 2021-11-03 23:30 under Publications, reduced basis, quantum spin systems, strong correlation

2021-11-02

michael-herbst.com

Quantum Chemistry Common Driver and Databases (QCDB) and Quantum Chemistry Engine (QCEngine): Automation and Interoperability among Computational Chemistry Programs

As part of my previous work on the adcc code for computational spectroscopy based on the algebraic-diagrammatic construction (ADC), we also integrated the package with QCEngine. This package aims at integrating different quantum-chemistry codes under a common interface for end users, which is an effort I fully support. Recently the design and structure of QCEngine and the related QCDB packages have been summarised in a publication. Its full abstract reads:

Community efforts in the computational molecular sciences (CMS) are evolving toward modular, open, and interoperable interfaces that work with existing community codes to provide more functionality and composability than could be achieved with a single program. The Quantum Chemistry Common Driver and Databases (QCDB) project provides such capability through an application programming interface (API) that facilitates interoperability across multiple quantum chemistry software packages. In tandem with the Molecular Sciences Software Institute and their Quantum Chemistry Archive ecosystem, the unique functionalities of several CMS programs are integrated, including CFOUR, GAMESS, NWChem, OpenMM, Psi4, Qcore, TeraChem, and Turbomole, to provide common computational functions, i.e., energy, gradient, and Hessian computations as well as molecular properties such as atomic charges and vibrational frequency analysis. Both standard users and power users benefit from adopting these APIs as they lower the language barrier of input styles and enable a standard layout of variables and data. These designs allow end-to-end interoperable programming of complex computations and provide best practices options by default.

by Michael F. Herbst at 2021-11-02 23:30 under Publications, electronic structure theory, theoretical chemistry, adcc, algebraic-diagrammatic construction

2021-10-14

michael-herbst.com

A robust and efficient line search for self-consistent field iterations

In an ongoing effort with Antoine Levitt our aim is to develop reliable density-functional theory (DFT) methods for computational materials design. Recently we looked into a strategy to automatically select the damping parameter for the self-consistent field iterations (SCF). Our adaptive damping approach is based on a theoretically sound quadratic model for the DFT energy, which is used to fix the step size (damping) adaptively along the search directions suggested by an underlying algorithm (such as Pulay mixing, Kerker mixing, etc.). Our algorithm is fully automatic, i.e. an a priori damping selection is no longer required. In our work we test our method successfully on a range of challenging systems including supercells, transition-metal alloys or metallic surfaces. Overall our study shows adaptive damping to provide superior robustness over the traditional fixed-damping approach.

As I have reported in previous blog articles and we also discussed in our previous publication on black-box mixing strategies for inhomogeneous systems the main motivation of our work is to design numerical methods, which are parameter-free and automatically self-adapt to the simulated material. In modern simulation scenarios where millions of DFT calculations are required in order to generate training data or screen over large design spaces, robustness and automation are the key requirements. Often it is in fact less the computational time of the individual calculations, which limits overall throughput. Much rather it is the human factor, i.e. the human time required to setup, check and verify computations.

Clearly at the level of millions of calculations computational parameters can no longer be selected manually. Instead elaborate heuristics are employed to select basis set size, k-point sampling, SCF algorithm or the damping parameter. In case a calculation fails heuristics are also employed for automatic restart. However, this approach is far from perfect and even an optimistic 1% failure rate easily equals thousands of calculations, which require human attention. With our work (both the previous paper as well as this one) we want to replace heuristic approaches to parameter selection by algorithms that employ a mixture of mathematical and physical insight to automatically adapt to the simulation at hand. As we demonstrate in this work, such algorithms might be associated with an increased effort compared to the best possible parameter setting, however it also makes calculations overall more robust. Therefore one saves (a) on the repeated effort to find a suitable parameter set by trial and error and (b) reduces the fraction of calculations, which need to be considered by a human. Overall the maximally attainable throughput can therefore be expected to increase from such a robust scheme despite the fact that an individual calculation might be more costly.

In this work in particular we considered the question of choosing the damping parameter. For this our adaptive damping approach is based on constructing an approximate quadratic model for the DFT energy and using this model within a line search procedure. Since this procedure is associated with an additional cost, we only employ it in case the proposed SCF step would either increase the DFT energy or SCF residual Notably our approach introduces no changes to the SCF in case each proposed SCF step by the mixing procedure is already perfect (i.e. energy or residual decreasing). Therefore adaptive damping can be considered a safeguared, which only comes into play if the proposed steps are noisy or erroneous. Adaptive damping is by construction orthogonal to any existing mixing and convergence acceleration technique for DFT methods and in our work we demonstrate it to integrate readily into an Anderson-accelerated SCF for various challenging systems. Overall we managed to increase performance and robustness at only a minor extra cost. The full abstract of our paper reads

We propose a novel adaptive damping algorithm for the self-consistent field (SCF) iterations of Kohn-Sham density-functional theory, using a backtracking line search to automatically adjust the damping in each SCF step. This line search is based on a theoretically sound, accurate and inexpensive model for the energy as a function of the damping parameter. In contrast to usual SCF schemes, the resulting algorithm is fully automatic and does not require the user to select a damping. We successfully apply it to a wide range of challenging systems, including elongated supercells, surfaces and transition-metal alloys.

by Michael F. Herbst at 2021-10-14 22:30 under Publications, electronic structure theory, theoretical chemistry, DFTK, Julia, DFT, numerical analysis, Kohn-Sham

2021-08-28

michael-herbst.com

Q-Chem 5 paper

About two years ago I integrated my open-source ctx library into the Q-Chem quantum-chemistry software suite. Quickly ctx became part of the core stack for managing computational results inside Q-Chem. In particular inside the ccman and adcman modules, which are responsible for most of the coupled-cluster and algebraic-diagrammatic construction methods available in Q-Chem, ctx is widely used.

In a recently published paper by all the Q-Chem authors the developments inside the Q-Chem package leading up the major version 5 of the software are now summarised. The full abstract reads

This article summarizes technical advances contained in the fifth major release of the Q-Chem quantum chemistry program package, covering developments since 2015. A comprehensive library of exchange-correlation functionals, along with a suite of correlated many-body methods, continues to be a hallmark of the Q-Chem software. The many-body methods include novel variants of both coupled-cluster and configuration-interaction approaches along with methods based on the algebraic diagrammatic construction and variational reduced density-matrix methods. Methods highlighted in Q-Chem 5 include a suite of tools for modeling core-level spectroscopy, methods for describing metastable resonances, methods for computing vibronic spectra, the nuclear–electronic orbital method, and several different energy decomposition analysis techniques. High-performance capabilities including multithreaded parallelism and support for calculations on graphics processing units are described. Q-Chem boasts a community of well over 100 active academic developers, and the continuing evolution of the software is supported by an "open teamware" model and an increasingly modular design.

by Michael F. Herbst at 2021-08-28 22:30 under Publications, electronic structure theory, theoretical chemistry

2021-08-28

sECuREs website

Silent HP Z440 workstation: replacing noisy fans

Since March 2020, I have been using my work computer at home: an HP Z440 workstation.

When I originally took the machine home, I immediately noticed that it’s quite a bit louder than my other PCs, but only now did I finally decide to investigate what I could do about it.

Finding all the fans

I first identified all fans, both by opening the chassis and looking around, and by looking at the HP Z440 Maintenance and Service Guide, which contains this description:

chassis components

Specifically, I identified the following fans:

  • “1 Fan”, a 92mm rear fan, sucking air out of the back of the chassis.
  • “5 Memory fans”, two 60mm fans in a custom HP plastic enclosure that are positioned directly above the DIMM slots to the left and right of the CPU.
  • “6 CPU Heat sink”, a 92mm fan on top of a heat sink
  • “11 Rear System Fan”, a 92mm front (!) fan, pulling air into the front of the chassis.
  • My aftermarket nVidia GeForce GPU has 3 fans on a massive heat sink.
  • The power supply has a fan, too, which I will not touch.

Memory fans

The Z440 comes with a custom HP plastic enclosure that is put over the CPU cooler, fastened with two clips at opposite ends, and positions two small 60mm fans above the DIMM banks.

This memory fan plastic enclosure is a pain to find anywhere. It looks like HP is no longer producing it.

The enclosure plugs into the mainboard with a custom connector that is directly wired up to the fans, meaning it’s a pain to replace the fans.

memory fans

Luckily, while shopping around for an enclosure I could modify, I realized that memory fans are only required when installing more than 4 DIMM modules!

My machine “only” has 64 GB of RAM, in 4 DIMM modules, and I don’t intend to upgrade anytime soon, so I just unplugged the whole memory fan enclosure and removed it from the chassis.

The UEFI firmware does not complain about the memory fans missing (contrary to the rear fan!), and this simple change alone makes a noticeable difference in noise levels.

GPU fans

nVidia GPUs can be run at different “PowerMizer” performance levels:

nVidia PowerMizer

Many years ago, I ran into lag when using Chrome that went away as soon as I switched my nVidia GPU’s Preferred Mode to “Prefer Maximum Performance” instead of “Auto” or “Adaptive mode”.

It turns out that nowadays, that is no longer a problem, so running at Prefer Maximum Performance is no longer necessary.

Worse, pinning the GPU at the highest Performance Level means that it produces more heat, resulting in the fans having to spin up more often, and run for longer durations.

But, even after switching to Auto, resulting in Adaptive mode being chosen, I noticed that my GPU was stuck at a higher PowerMizer level than I thought it should be.

An easy fix is to limit the GPU to a certain PowerMizer level, and ideally not the lowest level (level 0). For me, one level after that (level 1) seems to result in no slow-down during my typical usage.

I followed this blog post to limit my GPU to PowerMizer level 1, i.e. I added /etc/modprobe.d/nvidia-power-save.conf with the following contents:

options nvidia NVreg_RegistryDwords="OverrideMaxPerf=0x2"

…followed by a rebuild of my initramfs (update-initramfs -u) and a reboot.

This way, the fans don’t typically need to spin up as the GPU stays below its temperature limit.

Rear and front fans

With the memory fans and GPU fans out of the way, two easy to check fans remain: the rear fan and front fan. These are 92mm in size, the model number is Foxconn PVA092G12S.

rear fan

I unplugged both of them to see what effect these fans have on the noise level, and the difference was significant!

Unfortunately, unplugging isn’t enough: the UEFI firmware complains on boot when the rear fan is not connected, requiring you to press Enter to boot. Also, the machine seems to get a few degrees Celsius hotter inside without the front and rear fans, so I don’t want to run the machine without these fans for an extended period of time.

I ordered two Noctua NF-A9x14 PWM fans (for about 25 CHF each) to replace the stock front and rear fans.

Unfortunately, HP uses a custom 4-pin fan connector on its Z440 mainboard! Luckily, modifying the connector of the Noctua Low-Noise Adapter cable to fit on the custom 4-pin connector is as simple as using a knife to remove the connector’s guard rails:

fan connector mod

CPU fan

For the CPU fan, HP again chose to use a custom (6-pin) connector.

On the web, I read that the Z440 CPU fan is quite efficient and not worth replacing. This matches my experience, so I kept the standard Z440 CPU cooler.

Conclusion

I was quite happy to discover that I could just unplug the memory fans, and configure my GPU to make less noise. Together with replacing the front/rear fans with Noctua ones, the machine is much quieter now than before!

One downside of workstation-class hardware is that manufacturers (at least HP) like to build custom parts and solutions. Using their own fan connectors instead of standard connectors is such a pain! I’ll be sure to stick to standard PC hardware :)

at 2021-08-28 13:16

2021-08-04

michael-herbst.com

JuliaCon BoF discussion session: Building a Chemistry and Materials Science Ecosystem

The second event I co-organised at this year's JuliaCon (see this article for the other) was a Birds of Feather (BoF) discussion session titled Building a Chemistry and Materials Science Ecosystem in Julia. In this session Rachel Kurchin and I wanted to gather the various stakeholders working on Julia codes for chemistry and materials simulations and discuss possible overlaps and plan future joint efforts.

This has been the first time a meeting dedicated to this scientific field has been conducted within the Julia community and so we were quite curious about who would turn up. In the end we had a pretty mixed crowd consisting of Julia users tackling research problems in chemistry and materials as well as plenty of maintainers of various Julia packages related to the field, but also some veteran Julia users joined the discussion. This mix of people provoked a rather rich and lively debate about the perspectives of Julia in this respective field and the 90 minutes which were given to us passed almost in an instance.

A central discussion point within the session was the need for joint interfaces shared amongst the key packages of the ecosystem both to leverage Julia's unique composability between the various packages and to furthermore enhance the interoperability and lead to a good user experience. As many have pointed out during the session, a good first step is the design of an interface for representing the structure of the chemical system or the material to be studied. In particular this would allow to deveop unified approches to share data between packages, setup calculations and plainly compare between different approaches. Additionally annoying aspects such as file parsing, data export, plotting or other post-processing could then be easily implemented once using the general interface and used by everyone in the Julia community. Naturally a time slot of 90 minutes is just about sufficient to get the discussion started and scratch the surface, so the session has not yet yielded anything conclusive. However, following up from the conference the debate has definitely intensified amongst participants and I would not be suprised if some progress will be made.

In case you are interested to participate in these developments or plainly want to get in touch with Julia users and developers from chemistry, molecular or materials science, here are a number of relevant resouces:

by Michael F. Herbst at 2021-08-04 10:00 under Research, workshop, electronic structure theory, Julia

2021-08-03

michael-herbst.com

JuliaCon DFTK workshop: A mathematical look at electronic structure theory

From 13th July till 30th July this year's JuliaCon finally took place virtually. The first week (13th till 27th) hosted a number of three-hour live-streamed sessions of workshops, while the "regular" conference with a number of prerecorded talks started on 28th.

After my introductory talk to electronic structure theory and our DFTK code at last year's Juliacon, this year I participated at the conference with two events. One BoF discussion session gathering the people working on materials-science and electronic-structure codes in Julia about which I will write some more in a follow-up blog article.

My second event was a three-hour workshop titled A mathematical look at electronic structure theory in which I prepared a broadly accessible introduction into density-functional theory (DFT), the numerical procedures to solve DFT as well as some tools from numerical analysis to understand the convergence properties of these methods. As the tool to conduct the relevant calculations, code up and study the respective self-consistent field (SCF) algorithms we used our density-functional toolkit (DFTK). The workshop therefore also provides a great showcase for the merits of this code and how it leverages the broader Julia ecosystem to gain its unique features (arbitrary floating-point types, flexible and composable algorithms, automatic differentiation, numerical analysis techniques to investigate convergence failures, etc. ). For more details on the workshop see the dedicated teaching page.

What surprised me very positively during conducting the workshop was the large number of viewers that followed the workshop live and actively engaged by asking questions or posting comments on Youtube. Since the workshop was hosted at Juliacon I wouldn't have thought this topic would capture this many people, so in retrospect I am very happy I did it. In that sense also a big thanks to everyone who participated and provided me with feedback afterwards. (BTW: I'm still happy to take any in case you have some comments or suggesitons).

In case you missed the workshop the complete materials are available on github and the full recording of the workshop is available on Youtube.

by Michael F. Herbst at 2021-08-03 10:00 under Research, workshop, electronic structure theory, Kohn-Sham, high-throughput, DFT, DFTK, solid state, Julia

2021-08-01

michael-herbst.com

Virtual materials design 2021: Black-box density-functional theory methods

On 20th and 21st July 2021 the Virtual Materials Design 2021 CECAM workshop took place virtually. I was excited about this workshop and the opportunity to get in touch with researchers working on high-throughput computational materials design. While I am not actively working in this field the special requirements of the multitude of calculations running in this field clearly have been one of the main motivations for my work on DFTK, error control and black-box SCF algorithms. In advance of the workshop I asked the organisers to participate with a contributed talk to present my work to this community for the first time, which thankfully got accepted.

Due to the virtual format the workshop it was unfortunately rather packed, which allowed for little time to engage in discussion during the presentation slot. However, the organisers arranged multiple longer poster sessions in a GatherTown virtual world, which allowed for almost realistic face-to-face discussions. In these GatherTown sessions I talked with a number of scientists working on high-throughput studies as well as designing the large software infrastructures, which are commonly used to conduct these. At the level of performing millions of individual calculations in a screening study this naturally poses especial demands on the workflow software as well and I was curious to learn about some of the details.

With my focus on advocating a more mathematical look at screening and DFT simulations I represented a minority viewpoint at the meeting and I was very curious about the general feedback and critique of the more applied scientists in response to our recently proposed ideas. In general people were indeed quite interested to learn about our work on reliable SCF methods for inhomogeneous systems, but being confronted with our recent error estimation perspectives, some had doubts about the required effort being really worth it for DFT simulations. I certainly understand that concern. However, I think one should keep in mind the successes and potential, which has been unlocked by error estimation techniques in other fields, such as finite-element modelling or aerospace design. In these fields simulation methods have both become more efficient due to the lessons learned from uncertainty quantification and error estimation and the nowadays well-established error estimation techniques have furthermore contributed to prevent accidents from trusting faulty simulation data (such as the Sleipner A oil rig collapse). While clearly not all aspects of macroscopic modelling apply in the microscopic world, it is not hard to imagine that error bars establishing a guaranteed trustworthiness can make screening decisions more robust, thus potentially preventing costly manufacture of less useful compounds. Furthermore I expect a careful introduction of numerical errors (e.g. by lowering the floating-point type) to balance numerical error against the (usually much larger) DFT model error to allow for notable computational savings when performing on the order of millions of DFT calculations.

Overall I have enjoyed the two afternoons with many discussions in the high-throughput design community. As usual my slides are attached below.

Link
Towards error-controlled, black-box density-functional theory methods (Slides)

by Michael F. Herbst at 2021-08-01 10:00 under Research, talk, electronic structure theory, Kohn-Sham, high-throughput, DFT, DFTK, solid state

2021-07-15

michael-herbst.com

SSD Seminar: Accelerating the discovery of tomorrow's materials by robust and error-controlled simulations

A couple of days ago, on 12th July, I was invited to present my research in the SSD Seminar Series of RWTH Aachen. Being part of the research training group on modern inverse problems as well as the School for Simulation and Data Science (SSD) the SSD seminars are interdisciplinary and feature researchers as well as Master-level students from a couple of departments at RWTH (mathematics, computer science, simulation sciences, ...).

To make my recent work on error estimation and the design of robust algorithms for density-functional theory broadly accessible I started by motivating the need for density-functional theory (DFT) and high-throughput methods for the discovery and design of novel materials. Afterwards I briefly hinted at the mathematical structure of the equations, which need to be solved to obtain DFT properties. With this in mind I presented current research questions at the edge of mathematics and electronic-structure modelling and presented some of my recent results. As usual the slides are attached below.

Link
Accelerating the discovery of tomorrow's materials by robust and error-controlled electronic-structure simulations (Slides)

by Michael F. Herbst at 2021-07-15 10:01 under Research, talk, electronic structure theory, Kohn-Sham, high-throughput, DFT, solid state

2021-07-15

michael-herbst.com

Talk at many-body seminar at RWTH

On 29th June I was invited to present a short summary of my research at the seminar of the research training group Quantum Many-Body Methods at RWTH Aachen University. In the talk I give a overview over my ongoing work about reliable black-box self-consistent field schemes for high-throughput DFT calculations. My slides are attached below.

Link
Reliable black-box self-consistent field schemes for high-throughput DFT calculations (Slides)

by Michael F. Herbst at 2021-07-15 10:00 under Research, talk, electronic structure theory, Kohn-Sham, high-throughput, DFT, solid state

2021-07-10

sECuREs website

25 Gigabit Linux internet router PC build

init7 recently announced that with their FTTH fiber offering Fiber7, they will now sell and connect you with 25 Gbit/s (Fiber7-X2) or 10 Gbit/s (Fiber7-X) fiber optics, if you want more than 1 Gbit/s.

While this offer will only become available at my location late this year (or possibly later due to the supply chain shortage), I already wanted to get the hardware on my end sorted out.

After my previous disappointment with the MikroTik CCR2004, I decided to try a custom PC build.

An alternative to many specialized devices, including routers, is to use a PC with an expansion card. An internet router’s job is to configure a network connection and forward network packets. So, in our case, we’ll build a PC and install some network expansion cards!

router PC build

Goals

For this PC internet router build, I had the following goals, highest priority to lowest priority:

  1. Enough performance to saturate 25 Gbit/s, e.g. with two 10 Gbit/s downloads.
  2. Silent: no loud fan noise.
  3. Power-efficient: low power usage, as little heat as possible.
  4. Low cost (well, for a high-end networking build…).

Network Port Plan

The simplest internet router has 2 network connections: one uplink to the internet, and the local network. You can build a router without extra cards by using a mainboard with 2 network ports.

Because there are no mainboards with SFP28 slots (for 25 Gbit/s SFP28 fiber modules), we need at least 1 network card for our build. You might be able to get by with a dual-port SFP28 network card if you have an SFP28-compatible network switch already, or need just one fast connection.

I want to connect a few fast devices (directly and via fiber) to my router, so I’m using 2 network cards: an SFP28 network card for the uplink, and a quad-port 10G SFP+ network card for the local network (LAN). This leaves us with the following network ports and connections:

Network Card max speed cable effective Connection
Intel XXV710 25 Gbit/s fiber 25 Gbit/s Fiber7-X2 uplink
Intel XXV710 25 Gbit/s DAC 10 Gbit/s workstation
Intel XL710 10 Gbit/s RJ45 1 Gbit/s rest (RJ45 Gigabit)
Intel XL710 10 Gbit/s fiber 10 Gbit/s MikroTik 1
Intel XL710 10 Gbit/s fiber 10 Gbit/s MikroTik 2
Intel XL710 10 Gbit/s / 10 Gbit/s (unused)
onboard 2.5 Gbit/s RJ45 1 Gbit/s (management)
network connectors

Hardware selection

Now that we have defined the goals and network needs, let’s select the actual hardware!

Network Cards

My favorite store for 10 Gbit/s+ network equipment is FS.COM. They offer Intel-based cards:

Network cards

Both cards work out of the box with the i40e Linux kernel driver, no firmware blobs required.

For a good overview over the different available Intel cards, check out the second page (“Product View”) in the card’s User Manual.

CPU and Chipset

I read on many different sites that AMD’s current CPUs beat Intel’s CPUs in terms of performance per watt. We can better achieve goals 2 and 3 (low noise and low power usage) by using fewer watts, so we’ll pick an AMD CPU and mainboard for this build.

AMD’s current CPU generation is Zen 3, and current Zen 3 based CPUs can be divided into 65W TDP (Thermal Design Power) and 105W TDP models. Only one 65W model is available to customers right now: the Ryzen 5 5600X.

Mainboards are built for/with a certain so-called chipset. Zen 3 CPUs use the AM4 socket, for which 8 different chipsets exist. Our network cards need PCIe 3.0, so that disqualifies 5 chipsets right away: only the A520, B550 and X570 chipsets remain.

Ryzen 5

Mainboard: PCIe bandwidth

I originally tried using the ASUS PRIME X570-P mainboard, but I ran into two problems:

Too loud: X570 mainboards need an annoyingly loud chipset fan for their 15W TDP. Other chipsets such as the B550 don’t need a fan for their 5W TDP. With a loud chipset fan, goal 2 (low noise) cannot be achieved. Only the recently-released X570S variant comes without fans.

Not enough PCIe bandwidth/slots! This is how the ASUS tech specs describe the slots:

This means the board has 2 slots (1 CPU, 1 chipset) that are physically wide enough to hold a full-length x16 card, but only the first port can electronically be used as an x16 slot. The other port only has PCIe lanes electronically connected for x4, hence “x16 (max at x4 mode)”.

Unfortunately, our network cards need electrical connection of all their PCIe x8 lanes to run at full speed. Perhaps Intel/FS.COM will one day offer a new generation of network cards that use PCIe 4.0, because PCIe 4.0 x4 achieves the same 7.877 GB/s throughput as PCIe 3.0 x8. Until then, I needed to find a new mainboard.

Searching mainboards by PCIe capabilities is rather tedious, as mainboard block diagrams or PCIe tree diagrams are not consistently available from all mainboard vendors.

Instead, we can look explicitly for a feature called PCIe Bifurcation. In a nutshell, PCIe bifurcation lets us divide the PCIe bandwidth from the Ryzen CPU from 1 PCIe 4.0 x16 into 1 PCIe 4.0 x8 + 1 PCIe 4.0 x8, definitely satisfying our requirement for two x8 slots at full bandwidth.

I found a list of (only!) three B550 mainboards supporting PCIe Bifurcation in an Anandtech review. Two are made by Gigabyte, one by ASRock. I read the Gigabyte UEFI setup is rather odd, so I went with the ASRock B550 Taichi mainboard.

Case

For the case, I needed a midi case (large enough for the B550 mainboard’s ATX form factor) with plenty of options for large, low-spinning fans.

I stumbled upon the Corsair 4000D Airflow, which is available for 80 CHF and achieved positive reviews. I’m pleased with the 4000D: there are no sharp corners, installation is quick, easy and clean, and the front and top panels offer plenty of space for cooling behind large air intakes:

Airflow case (from the top)

Inside, the case offers plenty of space and options for routing cables on the back side:

Airflow case (back)

Which in turn makes for a clean front side:

Airflow case (front)

Fans

I have been happy with Noctua fans for many years. In this build, I’m using only Noctua fans so that I can reach goal 2 (silent, no loud fan noise):

Noctua fans

These fans are large (140mm), so they can spin on slow speeds and still be effective.

The specific fan configuration I ended up with:

  • 1 Noctua NF-A14 PWM 140mm in the front, pulling air out of the case
  • 1 Noctua NF-A14 PWM 140mm in the top, pulling air into the case
  • 1 Noctua NF-A12x25 PWM 120mm in the back, pulling air into the case
  • 1 Noctua NH-L12S CPU fan

Note that this is most likely overkill: I can well imagine that I could turn off one of these fans entirely without a noticeable effect on temperatures. But I wanted to be on the safe side and have a lot of cooling capacity, as I don’t know how hot the Intel network cards run in practice.

Fan Controller

The ASRock B550 Taichi comes with a Nuvoton NCT6683D-T fan controller.

Unfortunately, ASRock seems to have set the Customer ID register to 0 instead of CUSTOMER_ID_ASROCK, so you need to load the nct6683 Linux driver with its force option.

Once the module is loaded, lm-sensors lists accurate PWM fan speeds, but the temperature values are mislabeled and don’t quite match the temperatures I see in the UEFI H/W Monitor:

nct6683-isa-0a20
Adapter: ISA adapter
fan1:              471 RPM  (min =    0 RPM)
fan2:                0 RPM  (min =    0 RPM)
fan3:                0 RPM  (min =    0 RPM)
fan4:                0 RPM  (min =    0 RPM)
fan5:                0 RPM  (min =    0 RPM)
fan6:                0 RPM  (min =    0 RPM)
fan7:                0 RPM  (min =    0 RPM)
Thermistor 14:     +45.5 C  (low  =  +0.0 C)
                            (high =  +0.0 C, hyst =  +0.0 C)
                            (crit =  +0.0 C)  sensor = thermistor
AMD TSI Addr 98h:  +40.0 C  (low  =  +0.0 C)
                            (high =  +0.0 C, hyst =  +0.0 C)
                            (crit =  +0.0 C)  sensor = AMD AMDSI
intrusion0:       OK
beep_enable:      disabled

At least with the nct6683 Linux driver, there is no way to change the PWM fan speed: the corresponding files in the hwmon interface are marked read-only.

At this point I accepted that I won’t be able to work with the fan controller from Linux, and tried just configuring static fan control settings in the UEFI setup.

But despite identical fan settings, one of my 140mm fans would end up turned off. I’m not sure why — is it an unclean PWM signal, or is there just a bug in the fan controller?

Controlling the fans to reliably spin at a low speed is vital to reach goal 2 (low noise), so I looked around for third-party fan controllers and found the Corsair Commander Pro, which a blog post explains is compatible with Linux.

Server Disk

This part of the build is not router-related, but I figured if I have a fast machine with a fast network connection, I could add a fast big disk to it and retire my other server PC.

Specifically, I chose the Samsung 970 EVO Plus M.2 SSD with 2 TB of capacity. This disk can deliver 3500 MB/s of sequential read throughput, which is more than the ≈3000 MB/s that a 25 Gbit/s link can handle.

Graphics Card

An important part of computer builds for me is making troubleshooting and maintenance as easy as possible. In my current tech landscape, that translates to connecting an HDMI monitor and a USB keyboard, for example to boot from a different device, to enter the UEFI setup, or to look at Linux console messages.

Unfortunately, the Ryzen 5 5600X does not have integrated graphics, so to get any graphics output, we need to install a graphics card. I chose the Zotac GeForce GT 710 Zone Edition, because it was the cheapest available card (60 CHF) that’s passively cooled.

An alternative to using a graphics card might be to use a PCIe IPMI card like the ASRock PAUL, however these seem to be harder to find, and more expensive.

Longer-term, I think the best option would be to use the Ryzen 5 5600G with integrated graphics, but that model only becomes available later this year.

Component List

I’m listing 2 different options here. Option A is what I built (router+server), but Option B is a lot cheaper if you only want a router. Both options use the same base components:

Price Type Article
347 CHF Network card FS.COM Intel XXV710, 2 × 25 Gbit/s (#75603)
329 CHF Network card FS.COM Intel XL710, 4 × 10 Gbit/s (#75602)
314 CHF CPU Ryzen 5 5600X
290 CHF Mainboard ASRock B550 Taichi
92 CHF Case Corsair 4000D Airflow (Midi Tower)
67 CHF Fan control Corsair Commander Pro
65 CHF Case fan 2 × Noctua NF-A14 PWM (140mm)
62 CHF CPU fan Noctua NH-L12S
35 CHF Case fan 1 × Noctua NF-A12x25 PWM (120mm)
60 CHF GPU Zotac GeForce GT 710 Zone Edition (1GB)

Base total: 1590 CHF

Option A: Server extension. Because I had some parts lying around, and because I wanted to use my router for serving files (from large RAM cache/fast disk), I went with the following parts:

Price Type Article
309 CHF Disk Samsung 970 EVO Plus 2000GB, M.2 2280
439 CHF RAM 64GB HyperX Predator RAM (4x, 16GB, DDR4-3600, DIMM 288)
127 CHF Power supply Corsair SF600 Platinum (600W)
14 CHF Power ext Silverstone ATX 24-24Pin Extension (30cm)
10 CHF Power ext Silverstone ATX Extension 8-8(4+4)Pin (30cm)

The Corsair SF600 power supply is not server-related, I just had it lying around. I’d recommend going for the Corsair RM650x *2018* (which has longer cables) instead.

Server total: 2770 CHF

Option B: Non-server (router only) alternative. If you’re only interested in routing, you can opt for cheaper low-end disk and RAM, for example:

Price Type Article
112 CHF Power supply Corsair RM650x *2018*
33 CHF Disk Kingston A400 120GB M.2 SSD
29 CHF RAM Crucial CT4G4DFS8266 4GB DDR4-2666 RAM

Non-server total: 1764 CHF

ASRock B550 Taichi Mainboard UEFI Setup

To enable PCIe Bifurcation for our two PCIe 3.0 x8 card setup:

  1. Set Advanced > AMD PBS > PCIe/GFX Lanes Configuration
    to x8x8.

To always turn on the PC after power is lost:

  1. Set Advanced > Onboard Devices Configuration > Restore On AC Power Loss
    to Power On.

To PXE boot (via UEFI) on the onboard ethernet port (management), but disable slow option roms for PXE boot on the FS.COM network cards:

  1. Set Boot > Boot From Onboard LAN
    to Enabled.
  2. Set Boot > CSM (Compatibility Support Module) > Launch PXE OpROM Policy
    to UEFI only.

Fan Controller Setup

The Corsair Commander Pro fan controller is well-supported on Linux.

After enabling the Linux kernel option CONFIG_SENSORS_CORSAIR_CPRO, the device shows up in the hwmon subsystem.

You can completely spin up (100% PWM) or turn off (0% PWM) a fan like so:

# echo 255 > /sys/class/hwmon/hwmon3/pwm1
# echo 0 > /sys/class/hwmon/hwmon3/pwm1

I run my fans at 13% PWM, which translates to about 226 rpm:

# echo 33 > /sys/class/hwmon/hwmon3/pwm1
# cat /sys/class/hwmon/hwmon3/fan1_input
226

Conveniently, the Corsair Commander Pro stores your settings even when power is lost. So you don’t even need to run a permanent fan control process, a one-off adjustment might be sufficient.

Power Usage

The PC consumes about 48W of power when idle (only management network connected) by default without further tuning. Each extra network link increases power usage by ≈1W:

graph showing power consumption when enabling network links

Enabling all Ryzen-related options in my Linux kernel and switching to the powersave CPU frequency governor lowers power usage by ≈1W.

On some mainboards, you might need to force-enable Global C-States to save power. Not on the B550 Taichi, though.

I tried undervolting the CPU, but that didn’t even make ≈1W of difference in power usage. Potentially making my setup unreliable is not worth that little power saving to me.

I measured these values using a Homematic HM-ES-PMSw1-Pl-DN-R5 I had lying around.

Performance

Goal 1 is to saturate 25 Gbit/s, for example using two 10 Gbit/s downloads. I’m talking about large bulk transfers here, not many small transfers.

To get a feel for the performance/headroom of the router build, I ran 3 different tests.

Test A: 10 Gbit/s bridging throughput

For this test, I connected 2 PCs to the router’s XL710 network card and used iperf3(1) to generate a 10 Gbit/s TCP stream between the 2 PCs. The router doesn’t need to modify the packets in this scenario, only forward them, so this should be the lightest load scenario.

bridging throughput

Test B: 10 Gbit/s NAT throughput

In this test, the 2 PCs were connected such that the router performs Network Address Translation (NAT), which is required for downloads from the internet via IPv4.

This scenario is slightly more involved, as the router needs to modify packets. But, as we can see below, a 10 Gbit/s NAT stream consumes barely more resources than 10 Gbit/s bridging:

NAT throughput

Test C: 4 × 10 Gbit/s TCP streams

In this test, I wanted to max out the XL710 network card, so I connected 4 PCs and started an iperf3(1) benchmark between each PC and the router itself, simultaneously.

This scenario consumes about 16% CPU, meaning we’ll most likely have plenty of headroom even when all ports are maxed out!

four 10 Gbit/s streams

Tip: make sure to enable the CONFIG_IRQ_TIME_ACCOUNTING Linux kernel option to include IRQ handlers in CPU usage numbers for accurate measurements.

Alternatives considered

The passively-cooled SuperServer E302-9D comes with 2 SFP+ ports (10 Gbit/s). It even comes with 2 PCIe 3.0 x8 capable slots. Unfortunately it seems impossible to currently buy this machine, at least in Switzerland.

You can find a few more suggestions in the replies of this Twitter thread. Most are either unavailable, require a lot more DIY work (e.g. a custom case), or don’t support 25 Gbit/s.

Router software: router7 porting

I wrote router7, my own small home internet router software in Go, back in 2018, and have been using it ever since.

I don’t have time to support any users, so I don’t recommend anyone else use router7, unless the project really excites you, and the lack of support doesn’t bother you! Instead, you might be better served with a more established and supported router software option. Popular options include OPNsense or OpenWrt. See also Wikipedia’s List of router and firewall distributions.

To make router7 work for this 25 Gbit/s router PC build, I had to make a few adjustments.

Because we are using UEFI network boot instead of BIOS network boot, I first had to make the PXE boot implementation in router7’s installer work with UEFI PXE boot.

I then enabled a few additional kernel options for network and storage drivers in router7’s kernel.

To router7’s control plane code, I added bridge network device configuration, which in my previous 2-port router setup was not needed.

During development, I compiled a few Linux programs statically or copied them with their dependencies (→ gokrazy prototyping) to run them on router7, such as sensors(1) , ethtool(8) , as well as iproute2’s ip(8) and bridge(8) implementation.

Next Steps

Based on my tests, the hardware I selected seems to deliver enough performance to use it for distributing a 25 Gbit/s upstream link across multiple 10 Gbit/s devices.

I won’t know for sure until the fiber7 Point Of Presence (POP, German Anschlusszentrale) close to my home is upgraded to support 25 Gbit/s “Fiber7-X2” connections. As I mentioned, unfortunately the upgrade plan is delayed due to the component shortage. I’ll keep you posted!

Other Builds

In case my build doesn’t exactly match your requirements, perhaps these others help inspire you:

Appendix A: DPDK test

Pim ran a DPDK based loadtester called T-Rex on this machine. Here’s his summary of the test:

For DPDK, this hardware does 4x10G at 64b frames. It does not do 6x10G as it tops out at 62Mpps using 4 cores (of 15.5Mpps per core).

I couldn’t test 25G symmetric [because we lacked a 25G DAC cable], but extrapolating from the numbers, 3 CPUs source and sink ~24.6Gbit per core, so we’d probably make it, leaving 1 core for OS and 2 cores for controlplane.

If the machine had a 12 core Ryzen, it would saturate all NICs with room to spare. So that’s what I’ll end up buying :)

DPDK test

at 2021-07-10 11:43

2021-06-21

michael-herbst.com

Errors and uncertainty quantification in density-functional theory

On 8th June I was invited to the seminar of the Uncertainty Quantification (UQ) group of Prof. Youssef Marzouk at MIT. Youssef and I planned to have this seminar since my involvement with MIT's CESMIX project last February (see also this blog article), but it took use quite some time to get it arranged. Finally I managed to present my point of view on UQ in density-functional theory (DFT), sneakily re-using most of the slides I had already prepared for my recent UQ-in-DFT talk at RWTH Aachen's UQ group the week earlier.

Similar to the Aachen talk I've put strong emphasis on engaging audience participation and discussion. I first introduced the UQ group to electronic structure theory and DFT, allowing for enough time to discuss the key ideas of the physics. Then I pointed out current research in error estimation and UQ in DFT and provided a number of opportunities for interesting future UQ-related research. The discussion was very lively and I hardly made it beyond a slide without a question, which was just great. Since a lot could be gained from stronger uncertainty quantification tools in DFT in my opinion, I hoped this talk made DFT more accessible to the UQ group and made some people curious to look into the details. On my end I would definitely enjoy to learn more about UQ in the future and look forward to my future UQ-related involvements in the CESMIX project. As usual my slides are attached below.

Link
Errors and uncertainty quantification in electronic-structure theory (Slides)

by Michael F. Herbst at 2021-06-21 10:00 under Research, talk, electronic structure theory, Kohn-Sham, uncertainty quantification, DFT, solid state

2021-06-15

michael-herbst.com

Talk at MATH4UQ seminar series at RWTH

On 1st June I was invited to the MATH4UQ seminar series of the Mathematics of Uncertainty Quantification chair of Prof. Raul Tempone at RWTH Aachen University.

Over the past months I got more and more interested in mathematical methods for uncertainty quantification (UQ) as an opportunity to estimate and understand errors in density-functional theory (DFT) calculations. In particular I imagine UQ methods to be useful to estimate the model error of a DFT model itself. At this level statistical approaches are likely the only feasible option for a practical error estimation, since the mathematical complexity of modern DFT models beyond the local density approximations very likely make a posteriori error analysis strategies extremely infeasible.

In my talk I explain the basics of DFT and provide a rough overview of present UQ developments in this method. Since I know very little about UQ and my audience knew very little about DFT, I intended the talk to be more of a Q&A session, where the slides are around to stimulate discussion. This turned out to work very well and I am very grateful to the many interesting questions from the audience and the enjoyful discussion. As usual my slides are attached below. Additionally a recording of my talk can be found on youtube.

Link
Errors in electronic-structure theory: Status and directions for future research (Slides)
Youtube recording of the talk

by Michael F. Herbst at 2021-06-15 10:00 under Research, talk, electronic structure theory, Kohn-Sham, uncertainty quantification, DFT, solid state

2021-06-05

sECuREs website

Laptop review: ThinkPad X1 Extreme (Gen 2)

ThinkPad X1 Extreme Gen 2, pear for scale

For many of my school and university years, I used and liked my ThinkPad X200 ultraportable laptop. But now that these years are long gone, I realized my use-case for laptops had changed: instead of carrying my laptop with me every day, I am now only bringing it on occasion, for example when I travel to conferences, visit friends, or do volunteer work.

After the ThinkPad X200, I used a few different laptops:

  • MacBook Pro 13" Retina, bought for its screen
  • ThinkPad X1 Carbon, which newly introduced a hi-dpi screen to ThinkPads
  • Dell XPS 9360, for a change, to try a device that ships with Linux

With each of these devices, I have felt limited by the lack of connectors and slim compute power that comes with the Ultrabook brand, even after years of technical progress.

More compute power is nice to be able to work on projects with larger data sets, for example debiman (scanning and converting all manpages in Debian), or distri (building Linux packages).

More peripheral options such as USB ports are nice when connecting a keyboard, trackball, USB-to-serial adapter, etc., to work on a micro controller or Raspberry Pi project, for example.

So, I was ready to switch from the heaviest Ultrabooks to the lightest of the “mobile workstation” category, when I stumbled upon Lenovo’s ThinkPad X1 Extreme (Gen 2), and it piqued my curiosity.

Peripherals

Let me start by going into the key peripherals of a laptop: keyboard, touchpad and screen. I will talk about these independently from the remaining hardware because they define the experience of using the computer.

Keyboard

After having used the Dell XPS 9360 for a few years, I can confidently say that the keyboard of the ThinkPads is definitely much better, and in a noticeable way.

It’s not that the Dell keyboards are bad. But comparing the Dell and ThinkPad side-by-side makes it really clear that the ThinkPad keyboards are the best notebook keyboards.

On the ThinkPad keyboard, every key press lands exactly as I imagine. Never do I need to hit a key twice because I didn’t press it just-right, and never do I notice additional ghost key presses.

Even though I connect my external Kinesis Advantage keyboard when doing longer stretches of work, the quality of the built-in keyboard matters: a good keyboard enables using the laptop on the couch.

Touchpad

Unfortunately, while the keyboard is great, I can’t say the same about the touchpad. I mean, it’s not terrible, but it’s also not good by any stretch.

This seems to be the status quo with PC touchpads for decades. It really blows my mind that Apple’s touchpads are consistently so much better!

My only hope is that Bill Harding (GitClear), who is working on improving the Linux touchpad experience, will eventually find a magic software tweak or something…

As mentioned on the ArchWiki, I also had to adjust the sensitivity like so:

% xinput set-prop 'SynPS/2 Synaptics TouchPad' 'libinput Accel Speed' 0.5

Display

I have high demands regarding displays: since 2013, every device of mine has a hi-dpi display.

The industry hasn’t improved displays across the board as fast as I’d like, so non-hi-dpi displays are still quite common. The silver lining is that it makes laptop selection a little easier for me: anything without a decent display I can discard right away.

I’m glad to report that the 4K display in the ThinkPad X1 Extreme with its 3840x2160 pixels is sharp, bright, and generally has good viewing angles.

It’s also a touchscreen, which I don’t strictly need, but it’s nice to use it from time to time.

I use the display in 200% scaling mode, i.e. I set Xft.dpi: 192. See also HiDPI in ArchWiki.

Hardware

Spec-wise, the ThinkPad X1 Extreme is a beast!

ThinkPad X1 Extreme Specs

The build quality seems very robust to me.

Another big plus of the ThinkPad series over other laptop series is the availability of the official Hardware Maintenance Manual: you can put “ThinkPad X1 Extreme Gen 2 Hardware Maintenance Manual” into Google and will find p1_gen2_x1extreme_hmm_v1.pdf as the first hit. This manual describes in detail how to repair or upgrade your device if you want to (or have to) do it yourself.

WiFi

The built-in Intel AX200 WiFi interface works fine, provided you have a new-enough linux-firmware package and kernel version installed.

I had trouble with Linux 5.6.0, and Linux 5.6.5 fixed it. Luckily, at the time of writing, Linux 5.11 is the most recent release, so most distributions should be recent enough for things to just work.

The WiFi card reaches almost the same download speed as the most modern WiFi device I can test: a MacBook Air M1. Both are connected to my UniFi UAP-AC-HD access point.

Laptop Download Upload
ThinkPad X1 Extreme 500 Mbit/s 150 Mbit/s
MacBook Air M1 600 Mbit/s 500 Mbit/s

I’m not sure why the upload speed is so low in comparison.

GPU

The GPU in this machine is by far the most troublesome bit of hardware.

I had hoped that after many years of laptops containing Intel/nVidia hybrid graphics, this setup would largely work, but was disappointed.

Both the proprietary nVidia driver and the nouveau driver would not work reliably for me. I ran into kernel error messages and hard-freezes, with even SSH sessions to the machine breaking.

In the end, I blacklisted the nouveau driver to use Intel graphics only:

% echo blacklist nouveau | sudo tee /etc/modprobe.d/blacklist.conf 

Without the nVidia driver, the GPU will not go into powersave mode, so I remove it from the PCI bus entirely to save power:

#!/bin/zsh

sudo tee /sys/bus/pci/devices/0000\:01\:00.0/remove <<<1
sudo tee /sys/bus/pci/devices/0000\:01\:00.1/remove <<<1

You can only re-awaken the GPU with a reboot.

Obviously this isn’t a great setup — I would prefer to be able to actually use the GPU. If you have any tips or a better experience, please let me know.

Also note that the HDMI port will be unusable if you go this route, as the HDMI port is connected to the nVidia GPU only.

Battery life

The 80 Wh battery lasts between 5 to 6 hours for me, without any extra power saving tuning beyond what the Linux distribution Fedora 33 comes with by default.

This is good enough for using the laptop away from a power socket from time to time, which matches my expectation for this kind of mobile workstation.

Software support

Linux support is generally good on this machine! Yes, I provide a few pointers in this article regarding problems, patches and old software versions. But, if you use a newer Linux distribution, all of these fixes are included and things just work out of the box. I tested with Fedora 33.

For a few months, I was using this laptop exclusively with my research Linux distribution distri, so even if you just track upstream software closely, the machine works well.

Firmware updates

Lenovo partnered with the Linux Vendor Firmware Service Project (LVFS), which means that through fwupd, ThinkPad laptops such as this X1 Extreme can easily receive firmware updates!

This is a huge improvement in comparison to earlier ThinkPad models, where you had to jump through hoops with Windows-only software, or CD images that you needed to boot just right.

If your laptop has a very old firmware version (before 1.30), you might be affected by the skipping keystrokes issues. You can check using the always-handy lshw(1) tool.

Performance

The specific configuration of my ThinkPad is:

ThinkPad X1 Extreme Spec (2020)
CPU Intel Core i7-9750H CPU @ 2.60GHz
RAM 2 × 32 GB Samsung M471A4G43MB1-CTD
Disk 2 × SAMSUNG MZVLB2T0HALB-000L7 NVMe disk

You can google for CPU benchmarks and comparisons yourself, and those likely are more scientific and carefully done than I have time for.

What I can provide however, is a comparison of working on one of my projects on the ThinkPad vs. on my workstation, an Intel Core i9-9900K that I bought in 2018:

Workstation Spec (2018)
CPU Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
RAM 4 × Corsair CMK32GX4M2A2666C16
Disk Corsair Force MP600 M.2 NVMe disk

Specifically, I am comparing how long my manpage static archive generator debiman takes to analyze and render all manpages in Debian unstable, using the following command:

ulimit -n 8192; time ~/go/bin/debiman \
  -keyring=/usr/share/keyrings/debian-archive-keyring.gpg \
  -sync_codenames=, \
  -sync_suites=unstable \
  -serving_dir=/srv/man/benchmark \
  -inject_assets=~/go/src/github.com/Debian/debiman/debian-assets \
  -concurrency_render=20 \
  -alternatives_dir=~/go/src/github.com/Debian/debiman/piuparts

On both machines, I ensured that:

  1. The CPU performance governor was set to performance
  2. A warm apt-cacher-ng cache was present, i.e. network download was not part of the test.
  3. Linux kernel caches were dropped using echo 3 | sudo tee /proc/sys/vm/drop_caches
  4. I was using debiman git revision f78c160

Here are the results:

Machine Time
i9-9900K Workstation 4:57,10 (100%)
ThinkPad X1 Extreme (Gen 2) 7:19,56 (147%)

This reaffirms my impression that even high-end laptop hardware just cannot beat a workstation setup (which has more space and better thermals), but it comes close enough to be useful.

Conclusion

Positives:

  • The ergonomics of the device really are great. It is a pleasure to type on a first-class, full-size ThinkPad keyboard. The screen has good quality and a high resolution.

  • Performance-wise, this machine can almost replace a proper workstation.

Negatives are:

  • the mediocre battery life
  • an annoyingly loud fan that spins up too frequently
  • poor software/driver support for hybrid nVidia GPUs.

Notably, all of these could be improved by better power saving, so perhaps it’s just a matter of time until Linux kernel developers land some improvements…? :)

at 2021-06-05 18:43

2021-05-30

michael-herbst.com

SIAM LA: Robust and efficient accelerated methods for density-functional theory

Just one day after my talk at the SIAM Materials Science conference (blog article) I gave another talk at a SIAM meeting, this time at SIAM Linear Algebra. I was very much looking forward to participate in SIAM LA, firstly because it was the first time I attended this conference, but also secondly because it was a good opportunity to talk about our recent algorithmic work on robust DFT methods to an international crowd of mathematicians.

I presented as part of the minisymposium Theory and Practice of Extrapolation and Acceleration Methods, which consisted of three interesting sessions of historic and recent talks about extrapolation and convergence acceleration in the broadest sense of the word. Both topics about iterative methods as well as summation theory and sequence summation were discussed, which turned out to be a very enjoyful mix. In that sense I am really grateful for the mini organisers, Agnieszka Miedlar and Yousef Saad, for the invitation and for allowing me to be part of the great sessions.

Beyond the mini I enjoyed a number of talks about emerging topics in numerical linear algebra such as mixed-precision computation, low-rank tensor approximations or randomised methods. Even though the time zone difference meant that the conference was mostly running during the afternoon and late evening for me and even though the collision with SIAM Materials Science made it quite a busy week, I took a lot from SIAM LA and I'm already looking forward to next time.

Link
Robust and Efficient Accelerated Methods for Kohn-Sham Density-Functional Theory

by Michael F. Herbst at 2021-05-30 16:01 under Research, talk, electronic structure theory, Julia, DFTK, numerical analysis, Kohn-Sham, high-throughput, DFT, solid state

2021-05-30

michael-herbst.com

SIAM MS: Using the density-functional toolkit to design black-box DFT methods

After being moved by one year due to the pandemic, the last two weeks (from 17th to 28th May) the SIAM materials science conference finally took place in virtual form. Unfortunately this meant that the conference was scheduled in parallel to the SIAM Linear Algebra virtual conference, where I also presented (blog article), which made my past two weeks rather busy.

At SIAM Materials I was invited to talk in the Minisymposium Numerics of electronic structure calculations, which was organised by the steering committee of the GAMM activity group moansi (Modelling, Analysis and Simulation of Molecular Systems). Besides interesting sessions about some mathematical insights to electronic-structure methods, this gave the mini the additional feature of a spring gathering for the usual crowd of the activity group, which I already had the pleasure to meet at previous moansi workshops.

In my talk I gave a broad overview of the recent projects we realised with the density-functional toolkit for making self-consistent field calculations for density-functional theory more robust and reliable. Apart from our work on preconditioning inhomogeneous systems with the LDOS preconditioner I also presented first work-in-progress results on an adaptive damping strategy we recently came up with. The idea of our method is to use a line search based on an approximate quadratic model for cases where a proposed SCF step is not successful (i.e. increases energy and SCF residual). This firstly allows to automatically choose the damping parameter (instead of requiring the user to choose one by trial and error). Secondly it makes the SCF procedure more robust, especially for tricky cases. For example in our experiments on Heusler alloys our adaptive damping approach was the only method that managed to converge on some cases.

Link
Using the density-functional toolkit (DFTK) to design black-box methods in density-functional theory

by Michael F. Herbst at 2021-05-30 16:00 under Research, talk, electronic structure theory, Julia, DFTK, theoretical chemistry, numerical analysis, Kohn-Sham, high-throughput, DFT, solid state

2021-05-28

sECuREs website

How I configured and then promptly returned a MikroTik CCR2004 router for Fiber7

init7 recently announced that with their FTTH fiber offering Fiber7, they will now sell and connect you with 25 Gbit/s (Fiber7-X2) or 10 Gbit/s (Fiber7-X) fiber optics, if you want more than 1 Gbit/s.

This is possible thanks to the upgrade of their network infrastructure as part of their “lifecycle management”, meaning the old networking gear was declared as end-of-life. The new networking gear supports not only SFP+ modules (10 Gbit/s), but also SFP28 modules (25 Gbit/s).

Availability depends on the POP (Point Of Presence, German «Anschlusszentrale») you’re connected to. My POP is planned to be upgraded in September.

Nevertheless, I wanted to already prepare my end of the connection, and ordered the only router that init7 currently lists as compatible with Fiber7-X/X2: the MikroTik CCR2004-1G-12S+2XS.

MikroTik CCR2004-1G-12S+2XS

The rest of this article walks through what I needed to configure (a lot, compared to Ubiquiti or OpenWRT) in the hope that it helps other MikroTik users, and then ends in Why I returned it.

Configuration

Connect an Ethernet cable to the management port on the MikroTik and:

  1. log into the system using ssh admin@192.168.88.1
  2. point a web browser to “Webfig” at http://192.168.88.1/ (no login required)

Update firmware

Update the CCR2004 to the latest firmware version. At the time of writing, the Long-term RouterOS track is at version 6.47.9 for the CCR2004 (ARM64):

  1. Use /system package print to display the current version.
  2. Upload routeros-arm64-6.47.9.npk using Webfig.
  3. /system reboot and verify that /system package print shows 6.47.9 now.

Set up auth

Set a password to prevent others from logging into the router:

/user set admin password=secret

Additionally, you can enable passwordless SSH key login, if you want.

  1. Create an RSA key, because ed25519 keys are not supported:

    % ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key: /home/michael/.ssh/id_mikrotik
    
  2. Upload the id_mikrotik.pub file in Webfig

  3. Import the SSH public key for the admin user:

    /user ssh-keys import user=admin public-key-file=id_mikrotik.pub
    

Lock down the router

  1. Enable HTTPS in Webfig.

  2. Disable all remote access except for SSH and HTTPS:

    /ip service disable telnet,ftp,www,api,api-ssl,winbox
    
  3. Follow MikroTik Securing Your Router recommendations:

    /tool mac-server set allowed-interface-list=none
    /tool mac-server mac-winbox set allowed-interface-list=none
    /tool mac-server ping set enabled=no
    /tool bandwidth-server set enabled=no
    /ip ssh set strong-crypto=yes
    /ip neighbor discovery-settings set discover-interface-list=none
    

Enable DHCPv6 Client

For some reason, you need to explicitly enable IPv6 in 2021:

/system package enable ipv6
/system reboot

MikroTik says this is a precaution so that users don’t end up with default-open firewall settings for IPv6. But then why don’t they just add some default firewall rules?!

Anyway, to configure and immediately enable the DHCPv6 client, use:

/ipv6 dhcp-client add pool-name=fiber7 pool-prefix-length=64 interface=sfp28-1 add-default-route=yes use-peer-dns=no request=address,prefix

Modify the IPv6 DUID

Unfortunately, MikroTik does not offer any user interface to set the IPv6 DUID, which I need to configure to obtain my static IPv6 network prefix from my provider’s DHCPv6 server.

Luckily, the DUID is included in backup files, so we can edit it and restore from backup:

  1. Run /system backup save

  2. Download the backup file in Webfig by navigating to Files → Backup → Download.

  3. Convert the backup file to hex in textual form, edit the DUID and convert it back to binary:

    % xxd MikroTik-19700102-0111.backup MikroTik-19700102-0111.backup.hex
    
    % emacs MikroTik-19700102-0111.backup.hex
    # Search for “dhcp/duid” in the file and edit accordingly:
    # got:  00030001085531dfa69e
    
    % xxd -r MikroTik-19700102-0111.backup.hex MikroTik-19700102-0111-patched.backup
    
  4. Upload the file in Webfig, then restore the backup:

    /system backup load name=MikroTik-19700102-0111-patched.backup

Enable IPv6 Router Advertisements

To make the router assign an IPv6 address from the obtained pool for itself, and then send IPv6 Router Advertisements to the network, set:

/ipv6 address add address=::1 from-pool=fiber7 interface=bridge1
/ipv6 nd add interface=bridge1 managed-address-configuration=yes other-configuration=yes

Enable DHCPv4 Client

To configure and immediately enable the DHCPv4 client on the upstream port, use:

/ip dhcp-client add interface=sfp28-1 disabled=no

I also changed the MAC address to match my old router’s address, just to take maximum precaution to avoid any Port Security related issues with my provider’s DHCP server:

/interface ethernet set sfp28-1 mac-address=00:0d:fa:4c:0c:31

Enable DNS Server

By default, only the MikroTik itself can send DNS queries. Enable access for network clients:

/ip dns set allow-remote-requests=yes

Enable DHCPv4 Server

First, let’s bundle all SFP+ ports into a single bridge interface:

/interface bridge add name=bridge1
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus1 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus2 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus3 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus4 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus5 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus6 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus7 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus8 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus9 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus10 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus11 hw=yes
/interface bridge port add bridge=bridge1 interface=sfp-sfpplus12 hw=yes

This means we’ll use the device like a big switch with routing between the switch and the uplink port sfp28-1.

To configure the DHCPv4 Server, configure an IP address, then start the setup wizard:

/ip address add address=10.0.0.1/24 interface=bridge1
/ip dhcp-server setup
Select interface to run DHCP server on

dhcp server interface: bridge1
Select network for DHCP addresses

dhcp address space: 10.0.0.0/24
Select gateway for given network

gateway for dhcp network: 10.0.0.1
Select pool of ip addresses given out by DHCP server

addresses to give out: 10.0.0.2-10.0.0.240
Select DNS servers

dns servers: 10.0.0.1
Select lease time

lease time: 20m

Enable IPv4 NAT

We need NAT to route all IPv4 traffic over our single public IP address:

/ip firewall nat add action=masquerade chain=srcnat out-interface=sfp28-1 to-addresses=0.0.0.0

Disable NAT services for security, e.g. to mitigate against NAT slipstreaming attacks:

/ip firewall service-port disable ftp,tftp,irc,h323,sip,pptp,udplite,dccp,sctp

I can observe ≈10-20% CPU load when doing a Gigabit speed test over IPv4.

TODO list

The following features I did not get around to configuring, but they were on my list:

Why I returned it

Initially, I thought the device’s fan spins up only at boot, and then the large heatsink takes care of all cooling needs. Unfortunately, after an hour or so into my experiment, I noticed that the MikroTik would spin up the fan for a whole minute or so occasionally! Very annoying.

I also ran into weird DNS slow-downs, which I didn’t fully diagnose. In Wireshark, it looked like my machine sent 2 DNS queries but received only 1 DNS result, and then waited for a timeout.

I also noticed that I have a few more unexpected dependencies such as my home automation using DHCP lease state by subscribing to an MQTT topic. Addressing this issue and other similar little problems would have taken a bunch more time and would have resulted in a less reliable system than I have today.

Since I last used MikroTik in 2014 the software seems to have barely changed. I wish they finally implemented some table-stakes features like DNS resolution for DHCP hostnames.

Given all the above, I no longer felt like getting enough value for the money from the MikroTik, and found it easier to just switch back to my own router7 and return the MikroTik.

I will probably stick with the router7 software, but exchange the PC Engines APU with the smallest PC that has enough PCI-E bandwidth for a multi-port SFP28 network card.

Appendix A: Full configuration

# may/28/2021 11:40:15 by RouterOS 6.47.9
# software id = 6YZE-HKM8
#
# model = CCR2004-1G-12S+2XS
/interface bridge
add name=bridge1
/interface ethernet
set [ find default-name=sfp28-1 ] auto-negotiation=no mac-address=00:0d:fa:4c:0c:31
/interface wireless security-profiles
set [ find default=yes ] supplicant-identity=MikroTik
/ip pool
add name=dhcp_pool0 ranges=10.0.0.2-10.0.0.240
/ip dhcp-server
add address-pool=dhcp_pool0 disabled=no interface=bridge1 lease-time=20m name=dhcp1
/interface bridge port
add bridge=bridge1 interface=sfp-sfpplus1
add bridge=bridge1 interface=sfp-sfpplus2
add bridge=bridge1 interface=sfp-sfpplus3
add bridge=bridge1 interface=sfp-sfpplus4
add bridge=bridge1 interface=sfp-sfpplus5
add bridge=bridge1 interface=sfp-sfpplus6
add bridge=bridge1 interface=sfp-sfpplus7
add bridge=bridge1 interface=sfp-sfpplus8
add bridge=bridge1 interface=sfp-sfpplus9
add bridge=bridge1 interface=sfp-sfpplus10
add bridge=bridge1 interface=sfp-sfpplus11
add bridge=bridge1 interface=sfp-sfpplus12
/ip neighbor discovery-settings
set discover-interface-list=none
/ip address
add address=192.168.88.1/24 comment=defconf interface=ether1 network=192.168.88.0
add address=10.0.0.1/24 interface=bridge1 network=10.0.0.0
/ip dhcp-client
add disabled=no interface=sfp28-1 use-peer-dns=no
/ip dhcp-server lease
add address=10.0.0.54 mac-address=DC:A6:32:02:AA:10
/ip dhcp-server network
add address=10.0.0.0/24 dns-server=10.0.0.1 domain=lan gateway=10.0.0.1
/ip dns
set allow-remote-requests=yes servers=8.8.8.8,8.8.4.4,2001:4860:4860::8888,2001:4860:4860::8844
/ip firewall nat
add action=masquerade chain=srcnat out-interface=sfp28-1 to-addresses=0.0.0.0
/ip firewall service-port
set ftp disabled=yes
set tftp disabled=yes
set irc disabled=yes
set h323 disabled=yes
set sip disabled=yes
set pptp disabled=yes
set udplite disabled=yes
set dccp disabled=yes
set sctp disabled=yes
/ip service
set telnet disabled=yes
set ftp disabled=yes
set www disabled=yes
set www-ssl certificate=webfig disabled=no
set api disabled=yes
set winbox disabled=yes
set api-ssl disabled=yes
/ip ssh
set strong-crypto=yes
/ipv6 address
add address=::1 from-pool=fiber7 interface=bridge1
/ipv6 dhcp-client
add add-default-route=yes interface=sfp28-1 pool-name=fiber7 request=address,prefix use-peer-dns=no
/ipv6 nd
add interface=bridge1 managed-address-configuration=yes other-configuration=yes
/system clock
set time-zone-name=Europe/Zurich
/system logging
add topics=dhcp
/tool bandwidth-server
set enabled=no
/tool mac-server
set allowed-interface-list=none
/tool mac-server mac-winbox
set allowed-interface-list=none
/tool mac-server ping
set enabled=no

at 2021-05-28 12:57

2021-05-20

RaumZeitLabor

GnOnlinePN – Workadventure Edition

Hey, ihr Aillioliebhaber!

Das wohlriechendste Ereignis des Jahres steht an: Am 5. Juni wird es garlicious bei der GnOnlinePN21!

Dieses Jahr wollen wir uns ab 20 Uhr im Workadventure treffen und gemeinsam dem weißen Knollengold huldigen. Untermalt wird das ganze mit den Klängen von seiner Durchlaucht, RZL Resident-DJ Asthma!

Genug mundgeruchgerechten Abstand bietet das Digit-ail-Event auch, sodass ihr euch mit dem Genuss knoblauchhaltiger Speisen und Getränke wirklich nicht zurückhalten müsst.

Lasst uns in der World alliumiteinander Garlic Bread grillen, Rezepte tauschen und mit Knoblauchtschunk anstoßen.

Den Link zur Teilnahme werden wir im Laufe des Veranstaltungsnachmittages über Twitter verbreiten. Bei Fragen oder Ideen könnt ihr uns via Mail oder Twitter erreichen.

Bis dann!
Man riecht sich!

Gnoblauchgnollen

by flederrattie at 2021-05-20 00:00

2021-05-16

sECuREs website

Home network 10 Gbit/s upgrade

After adding a fiber link to my home network, I am upgrading that link from 1 Gbit/s to 10 Gbit/s.

As a reminder, conceptually the fiber link is built using two media converters from/to ethernet:

0.9mm thin fiber cables

Schematically, this is what’s connected to both ends:

1 Gbit/s bottleneck

All links are 1 Gbit/s, so it’s easy to see that, for example, transfers between chuchi↔router7 and storage2↔midna cannot both use 1 Gbit/s at the same time.

This upgrade serves 2 purposes:

  1. Raise the floor to 1 Gbit/s end-to-end: Ensure that serving large files (e.g. distri Linux images and packages) does no longer impact, and is no longer impacted by, other bandwidth flows that also use this transfer link in my home network, e.g. daily backups.

  2. Raise the ceiling to 10 Gbit/s: Make it possible to selectively upgrade Linux PCs on either end of the link to 10 Gbit/s peak bandwidth.

Note that the internet uplink remains untouched at 1 Gbit/s — only transfers within the home network can happen at 10 Gbit/s.

Replacing the media converters with Mikrotik switches

We first replace both media converters and switches with a Mikrotik CRS305-1G-4S+IN.

Mikrotik CRS305-1G-4S+IN

This device costs 149 CHF on digitec and comes with 5 ports:

  • 1 × RJ45 Ethernet port for management, can be used as a regular 1 Gbit/s port.
  • 4 × SFP+ ports

Each SFP+ port can be used with either an RJ-45 Ethernet or a fiber SFP+ module, but beware! As Nexus2kSwiss points out on twitter, the Mikrotik supports at most 2 RJ-45 SFPs at a time!

Fiber module upgrade

I’m using 10 Gbit/s fiber SFP+ modules for the fiber link between my kitchen and living room.

To make use of the 10 Gbit/s link between the switches, all devices that should get their guaranteed 1 Gbit/s end-to-end connection need to be connected directly to a Mikrotik switch.

I’m connecting the PCs to the switch using Direct Attach Cables (DAC) where possible. The advantage of DAC cables over RJ45 SFP+ modules is their lower power usage and heat.

The resulting list of SFP modules used in the two Mikrotik switches looks like so:

Mikrotik 1 SFP speed speed Mikrotik 2 SFP
chuchi 10 Gbit/s DAC 10 Gbit/s DAC midna
storage2 1 Gbit/s RJ45 1 Gbit/s RJ45 router7
10 Gbit/s BiDi ⬅ BiDi fiber link ➡ 10 Gbit/s BiDi

Hardware sourcing

The total cost of this upgrade is 676 CHF, with the biggest chunk spent on the Mellanox ConnectX-3 network cards and MikroTik switches.

FS (Fiber Store) order

FS.COM was my go-to source for anything fiber-related. Everything they have is very affordable, and products in stock at their German warehouse arrive in Switzerland (and presumably other European countries, too) within the same week.

num price name
1 × 34 CHF Generic Compatible 10GBASE-BX BiDi SFP+ 1270nm-TX/1330nm-RX 10km DOM Transceiver Module, FS P/N: SFP-10G-BX #74681
1 × 34 CHF Generic Compatible 10GBASE-BX BiDi SFP+ 1330nm-TX/1270nm-RX 10km DOM Transceiver Module, FS P/N: SFP-10G-BX #74682
2 × 14 CHF 3m Generic Compatible 10G SFP+ Passive Direct Attach Copper Twinax Cable
0 × 56 CHF SFP+ Transceiver Modul - Generisch kompatibel 10GBASE-T SFP+ Kupfer RJ-45 30m, FS P/N: SFP-10G-T #74680

digitec order

There are a few items that FS.COM doesn’t stock. These I bought at digitec, a big and popular electronics store in Switzerland. My thinking is that if products are available at digitec, they most likely are available at your preferred big electronics store, too.

num price name
2 × 149 CHF Mikrotik CRS305-1G-4S+IN switch

misc order

The Mellanox cards are not as widely available as I’d like.

I’m waiting for an FS.COM card to arrive, which might be a better choice.

num price name
2 × 129 EUR Mellanox ConnectX-3 MCX311A-XCAT

Mikrotik switch setup

I want to use my switches only as switches, not for any routing or other layer 3 features that might reduce bandwidth, so I first reboot the MikroTik CRS305-1G-4S+ into SwOS:

  1. In the web interface menu, navigate to System → Routerboard → Settings, open the Boot OS drop-down and select option SwOS.

  2. In the web interface menu, navigate to System → Reboot.

  3. After the device rebooted, change the hostname which was reset to MikroTik.

Next, upgrade the firmware to 2.12 to fix a weird issue with certain combinations of SFP modules (SFP-10G-BX in SFP1, SFP-10G-T in SFP2):

  1. In the SwOS web interface, select the Upgrade tab, then click Download & Upgrade.

Network card setup (Linux)

After booting with the Mellanox ConnectX3 in a PCIe slot, the card should show up in dmesg(8) :

mlx4_core: Mellanox ConnectX core driver v4.0-0
mlx4_core: Initializing 0000:03:00.0
mlx4_core 0000:03:00.0: DMFS high rate steer mode is: disabled performance optimized steering
mlx4_core 0000:03:00.0: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
mlx4_en 0000:03:00.0: Activating port:1
mlx4_en: 0000:03:00.0: Port 1: Using 16 TX rings
mlx4_en: 0000:03:00.0: Port 1: Using 16 RX rings
mlx4_en: 0000:03:00.0: Port 1: Initializing port
mlx4_en 0000:03:00.0: registered PHC clock
mlx4_core 0000:03:00.0 enp3s0: renamed from eth0
<mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
<mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
mlx4_en: enp3s0: Steering Mode 1
mlx4_en: enp3s0: Link Up

Another way to verify the device is running at maximum speed on the computer’s PCIe bus, is to ensure LnkSta matches LnkCap in the lspci(8) output:

% sudo lspci -vv
03:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
	Subsystem: Mellanox Technologies Device 0055
[…]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
[…]
		LnkCap:	Port #8, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[…]

You can verify your network link is running at 10 Gbit/s using ethtool(8) :

% sudo ethtool enp3s0
Settings for enp3s0:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseKX/Full
	                        10000baseKR/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  1000baseKX/Full
	                        10000baseKR/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Auto-negotiation: off
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Supports Wake-on: d
	Wake-on: d
        Current message level: 0x00000014 (20)
                               link ifdown
	Link detected: yes

Benchmarking batch transfers

As mentioned in the introduction, routing 10 Gbit/s is out of scope in this article. If you’re interested in routing performance, check out Andree Toonk’s post which confirms that Linux can route 10 Gbit/s at line rate.

The following sections cover individual batch transfers of large files, not many small flows.

iperf3 speed test

Out of the box, the speeds that iperf3(1) measures are decent:

chuchi % iperf3 --version
iperf 3.6 (cJSON 1.5.2)
Linux chuchi 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
Optional features available: CPU affinity setting, IPv6 flow label, SCTP, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing, authentication

chuchi % iperf3 --server
[…]

midna % iperf3 --version          
iperf 3.9 (cJSON 1.7.13)
Linux midna 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64
Optional features available: CPU affinity setting, IPv6 flow label, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing, authentication

midna % iperf3 --client chuchi.lan
Connecting to host 10.0.0.173, port 5201
[  5] local 10.0.0.76 port 43168 connected to 10.0.0.173 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.10 GBytes  9.42 Gbits/sec    0   1.62 MBytes       
[  5]   1.00-2.00   sec  1.09 GBytes  9.41 Gbits/sec    0   1.70 MBytes       
[  5]   2.00-3.00   sec  1.10 GBytes  9.41 Gbits/sec    0   1.70 MBytes       
[  5]   3.00-4.00   sec  1.09 GBytes  9.41 Gbits/sec    0   1.78 MBytes       
[  5]   4.00-5.00   sec  1.09 GBytes  9.41 Gbits/sec    0   1.87 MBytes       
[  5]   5.00-6.00   sec  1.10 GBytes  9.42 Gbits/sec    0   1.87 MBytes       
[  5]   6.00-7.00   sec  1.10 GBytes  9.42 Gbits/sec    0   1.87 MBytes       
[  5]   7.00-8.00   sec  1.10 GBytes  9.41 Gbits/sec    0   1.87 MBytes       
[  5]   8.00-9.00   sec  1.09 GBytes  9.41 Gbits/sec    0   1.96 MBytes       
[  5]   9.00-10.00  sec  1.09 GBytes  9.38 Gbits/sec  402   1.52 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  11.0 GBytes  9.41 Gbits/sec  402             sender
[  5]   0.00-10.00  sec  11.0 GBytes  9.40 Gbits/sec                  receiver

iperf Done.

HTTP speed test

Downloading a file from an nginx(1) web server using curl(1) is fast, too:

% curl -o /dev/null http://chuchi.lan/distri/supersilverhaze/img/distri-disk.img.zst
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  934M  100  934M    0     0  1118M      0 --:--:-- --:--:-- --:--:-- 1117M

Note that this download was served from RAM (Linux page cache). The next upgrade I need to do in this machine is replace the SATA SSD with an NVMe SSD, because the disk is now the bottleneck.

Conclusion

This was a pleasantly simple upgrade: plug in a bunch of new hardware and batch transfers become faster.

The Mikrotik switch provides great value for money, and the Mellanox ConnectX-3 cards work well, provided you can find them.

Appendix A: Switching from RJ45 SFP+ modules to Direct Attach Cables

Originally, I connected all PCs to the MikroTik switches with RJ45 SFP+ modules for two reasons:

  1. I bought Intel X550-T2 PCIe 10 Gbit/s network cards that RJ45 as my first choice.
  2. The SFP+ modules are backwards-compatible and can be used with 1 Gbit/s RJ45 devices, too, which makes for a nice incremental upgrade path.

However, I later was made aware that the RJ45 SFP+ modules use significantly more power and run significantly hotter than Direct Attach Cables (DAC).

I measured it: each RJ45 SFP+ module was causing my BiDi SFP+ module to run 5℃ hotter!

Around 06/02 I replaced one RJ45 SFP+ module with a Direct Attach Cable.

Around 06/06 I replaced the remaining RJ45 SFP+ module with another Direct Attach Cable.

As you can see, this caused a 10℃ drop in temperature of the BiDi SFP+ module.

The MikroTik is still uncomfortably hot, making it hard to work with when it’s powered on.

Appendix B: Network card setup (Linux) with Intel X550-T2

For reference, here is the Network card setup (Linux) section, but with the Intel X550-T2 that I previously used.

After booting with the Intel X550-T2 in a PCIe slot, the card should show up in dmesg(8) :

ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver
ixgbe 0000:03:00.0: Multiqueue Enabled: Rx Queue count = 16, Tx Queue count = 16 XDP Queue count = 0
ixgbe 0000:03:00.0: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
ixgbe 0000:03:00.0: MAC: 4, PHY: 0, PBA No: H86377-006
ixgbe 0000:03:00.0: Intel(R) 10 Gigabit Network Connection
libphy: ixgbe-mdio: probed
ixgbe 0000:03:00.1: Multiqueue Enabled: Rx Queue count = 16, Tx Queue count = 16 XDP Queue count = 0
ixgbe 0000:03:00.1: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
ixgbe 0000:03:00.1: MAC: 4, PHY: 0, PBA No: H86377-006
tun: Universal TUN/TAP device driver, 1.6
ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
libphy: ixgbe-mdio: probed
ixgbe 0000:03:00.0 enp3s0f0: renamed from eth0
ixgbe 0000:03:00.1 enp3s0f1: renamed from eth1
pps pps0: new PPS source ptp1
ixgbe 0000:03:00.0: registered PHC device on enp3s0f0
pps pps1: new PPS source ptp2
ixgbe 0000:03:00.1: registered PHC device on enp3s0f1

I think if you only use 1 of the card’s 2 network ports, you might not hit any bottlenecks even when running the card only at PCIe 3.0 ×2 link speed, but I haven’t verified this!

Another way to verify the device is running at maximum speed on the computer’s PCIe bus, is to ensure LnkSta matches LnkCap in the lspci(8) output:

% sudo lspci -vv
[…]
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
        Subsystem: Intel Corporation Ethernet Converged Network Adapter X550-T2
[…]
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
[…]
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[…]

You can verify your network link is running at 10 Gbit/s using ethtool(8) :

% sudo ethtool enp3s0f1 
Settings for enp3s0f1:
	Supported ports: [ TP ]
	Supported link modes:   100baseT/Full
	                        1000baseT/Full
	                        10000baseT/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  100baseT/Full
	                        1000baseT/Full
	                        10000baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Auto-negotiation: on
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	MDI-X: Unknown
	Supports Wake-on: d
	Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
	Link detected: yes

Appendix C: BIOS update for Mellanox ConnectX-3

On my Supermicro X11SSZ-QF mainboard, the Mellanox ConnectX-3 would not establish a link. The Mellanox Linux kernel driver logged a number of errors:

kernel: mlx4_en: enp1s0: CQE error - cqn 0x8e, ci 0x0, vendor syndrome: 0x57 syndrome: 0x4
kernel: mlx4_en: enp1s0: Related WQE - qpn 0x20d, wqe index 0x0, wqe size 0x40
kernel: mlx4_en: enp1s0: Scheduling port restart
kernel: mlx4_core 0000:01:00.0: Internal error detected:
kernel: mlx4_core 0000:01:00.0: device is going to be reset
kernel: mlx4_core 0000:01:00.0: crdump: devlink snapshot disabled, skipping
kernel: mlx4_core 0000:01:00.0: device was reset successfully
kernel: mlx4_en 0000:01:00.0: Internal error detected, restarting device
kernel: <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started
kernel: <mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended
kernel: mlx4_core 0000:01:00.0: command 0x21 failed: fw status = 0x1
kernel: pcieport 0000:00:1c.0: AER: Uncorrected (Fatal) error received: 0000:00:1c.0
kernel: pcieport 0000:00:1c.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
kernel: mlx4_core 0000:01:00.0: command 0x43 failed: fw status = 0x1
kernel: infiniband mlx4_0: ib_query_port failed (-5)
kernel: pcieport 0000:00:1c.0:   device [8086:a110] error status/mask=00040000/00010000
kernel: pcieport 0000:00:1c.0:    [18] MalfTLP                (First)
kernel: pcieport 0000:00:1c.0: AER:   TLP Header: 4a000001 01000004 00000000 00000000
kernel: mlx4_core 0000:01:00.0: mlx4_pci_err_detected was called
kernel: mlx4_core 0000:01:00.0: Fail to set mac in port 1 during unregister
systemd-networkd[313]: enp1s0: Link DOWN
kernel: mlx4_en: enp1s0: Failed activating Rx CQ
kernel: mlx4_en: enp1s0: Failed restarting port 1
kernel: mlx4_en: enp1s0: Link Down
kernel: mlx4_en: enp1s0: Close port called
systemd-networkd[313]: enp1s0: Lost carrier
kernel: mlx4_en 0000:01:00.0: removed PHC
kernel: mlx4_core 0000:01:00.0: mlx4_restart_one_up: ERROR: mlx4_load_one failed, pci_name=0000:01:00.0, err=-5
kernel: mlx4_core 0000:01:00.0: mlx4_restart_one was ended, ret=-5
systemd-networkd[313]: enp1s0: DHCPv6 lease lost
kernel: pcieport 0000:00:1c.0: AER: Root Port link has been reset
kernel: mlx4_core 0000:01:00.0: mlx4_pci_resume was called
kernel: mlx4_core 0000:01:00.0: Multiple PFs not yet supported - Skipping PF
kernel: mlx4_core 0000:01:00.0: mlx4_pci_resume: mlx4_load_one failed, err=-22
kernel: pcieport 0000:00:1c.0: AER: device recovery successful

What helped was to update the X11SSZ-QF BIOS to the latest version.

at 2021-05-16 15:33

2021-05-11

michael-herbst.com

Infomath seminar: A one-hour introduction to Julia

Just one day after my talk at the Lüchow group in Aachen, on 6th May I was asked to give a short introduction to Julia at the Infomath seminar series at Sorbonne Université. While virtual seminars certainly don't share the same spirit as in-person ones do, the ability to quickly hop between seminar series organised all across the world has advantages, too.

In my one-hour talk I gave a short introduction into Julia, focusing on the perspective of applied mathematicians. I gave a short speed comparsion of Julia, python and C on a simple example and presented some of its strengths in different application scenarios (numerical linear algebra, numerical methods for solving PDEs, data science and statistical learning). For future reference I have put the Jupyter notebooks I used during the lecture on github.

Link
An introductory hour to the Julia programming language (Github repository)

by Michael F. Herbst at 2021-05-11 10:01 under Research, talk, Julia, programming and scripting

2021-05-11

michael-herbst.com

Talk at Lüchow group seminar at RWTH

On 5th May I was invited to present a short summary of my research at the local theoretical chemistry research group of Prof. Dr. Arne Lüchow at RWTH Aachen. Because I wanted to give a broad overview of topics that I worked on over the past few years, I did not really go into many details. Nevertheless my talk lead to interesting and lively discussions, which I enjoyed very much. Clearly the pandemic makes it difficult to get in touch with other researchers, such that I was pretty happy to be able to get to get in touch with some more chemists from the RWTH during the seminar.

Link
High-throughput electronic-structure simulations: Where reliability really matters (Slides)

by Michael F. Herbst at 2021-05-11 10:00 under Research, talk, electronic structure theory, Kohn-Sham, high-throughput, DFT, solid state

2021-05-08

sECuREs website

Measure and reduce keyboard input latency with QMK on the Kinesis Advantage

Over the last few years, I worked on a few projects around keyboard input latency:

In 2018, I introduced the kinX keyboard controller with 0.2ms of input latency.

In 2020, I introduced the kinT keyboard controller, which works with a wide range of Teensy micro controllers, and both the old KB500 and the newer KB600 Kinesis Advantage models.

While the 2018 kinX controller had built-in latency measurement, I was starting from scratch with the kinT design, where I wanted to use the QMK keyboard firmware instead of my own firmware.

That got me thinking: instead of adjusting the firmware to self-report latency numbers, is there a way we can do latency measurements externally, ideally without software changes?

This article walks you through how to set up a measurement environment for your keyboard controller’s input latency, be it original or self-built. I’ll use a Kinesis Advantage keyboard, but this approach should generalize to all keyboards.

I will explain a few common causes for extra keyboard input latency and show you how to fix them in the QMK keyboard firmware.

Measurement setup

The idea is to connect a Teensy 4.0 (or similar), which simulates pressing the Caps Lock key and measures the duration until the keypress resulted in a Caps Lock LED change.

We use the Caps Lock key because it is one of the few keys that results in an LED change.

Here you can see the Teensy 4.0 connected to the kinT controller, connected to a laptop:

measurement setup

Enable the debug console in QMK

Let’s get our QMK working copy ready for development! I like to work in a separate QMK working copy per project:

% docker run -it -v $PWD:/usr/src archlinux
# pacman -Sy && pacman -S qmk make which diffutils python-hidapi python-pyusb
# cd /usr/src
# qmk clone -b develop qmk/qmk_firmware $PWD/qmk-input-latency
# cd qmk-input-latency

I compile the firmware for my keyboard like so:

# make kinesis/kint36:stapelberg

To enable the debug console, I need to edit my QMK keymap stapelberg by updating keyboards/kinesis/keymaps/stapelberg/rules.mk to contain:

CONSOLE_ENABLE = yes

After compiling and flashing the firmware, the hid_listen tool will detect the device and listen for QMK debug messages:

% sudo hid_listen
Waiting for device:...
Listening:

Finding the pins

Let’s locate the Caps Lock key’s corresponding row and column in our keyboard matrix!

We can make QMK show which keys are recognized after each scan by adding to keyboards/kinesis/keymaps/stapelberg/keymap.c the following code:

void keyboard_post_init_user() {
  debug_config.enable = true;
  debug_config.matrix = true;
}

Now we’ll see in the hid_listen output which key is active when pressing Caps Lock:

r/c 01234567
00: 00100000
01: 00000000
[…]

For our kinT controller, Caps Lock is on QMK matrix row 0, column 2.

In the kinT schematic, the corresponding signals are ROW_EQL and COL_2.

To hook up the Teensy 4.0 latency measurement driver, I am making the following GPIO connections to the kint36, kint41 or kint2pp (with voltage converter!) keyboard controllers:

driver 4.0 signal kint36, kint41 kint2pp (5V!)
GND GND GND GND
pin 10 ROW_EQL pin 8 D7
pin 11 COL_2 pin 15 F7
pin 12 LED_CAPS_LOCK pin 12 C1

Eager Caps Lock LED

When the host signals to the keyboard that Caps Lock is now turned on, the QMK firmware first updates a flag in the USB interrupt handler, but only updates the Caps Lock LED pin after the next matrix scan has completed.

This is fine in normal usage, but our measurement readings will get more precise if we immediately update the Caps Lock LED pin. We can do this in set_led_transfer_cb in tmk_core/protocol/chibios/usb_main.c, which is called from the USB interrupt handler:

#include "gpio.h"

static void set_led_transfer_cb(USBDriver *usbp) {
    if (usbp->setup[6] == 2) { /* LSB(wLength) */
        uint8_t report_id = set_report_buf[0];
        if ((report_id == REPORT_ID_KEYBOARD) || (report_id == REPORT_ID_NKRO)) {
            keyboard_led_state = set_report_buf[1];
        }
    } else {
        keyboard_led_state = set_report_buf[0];
    }
    if ((keyboard_led_state & 2) != 0) {
      writePinLow(C7); // turn on CAPS_LOCK LED
    } else {
      writePinHigh(C7); // turn off CAPS_LOCK LED
    }
}

Host side (Linux)

On the USB host, i.e. the Linux computer, I switch to a Virtual Terminal (VT) by stopping my login manager (killing my current graphical session!):

% sudo systemctl stop gdm

With the Virtual Terminal active, we know that the Caps Lock key press will be handled entirely in kernel driver code without having to round-trip to userspace.

We can verify this by collecting stack traces with bpftrace(8) when the kernel executes the kbd_event function in drivers/tty/vt:

% sudo bpftrace -e 'kprobe:kbd_event { @[kstack] = count(); }'

After pressing Caps Lock and cancelling the bpftrace process, you should see a stack trace.

I then measured the baseline end-to-end latency, using my measure-fw firmware running on the FRDM-K66F eval kit, a cheap and widely available USB 2.0 High Speed device. The firmware measures the latency between a button press and the USB HID report for the Caps Lock LED, but without any additional matrix scanning delay or similar:

% cat /dev/ttyACM0
sof=74 μs	report=393 μs
sof=42 μs	report=512 μs
sof=19 μs	report=512 μs
sof=39 μs	report=488 μs
sof=20 μs	report=518 μs
sof=90 μs	report=181 μs
sof=42 μs	report=389 μs
sof=7 μs	report=319 μs

This is the quickest reaction we can get out of this computer. Anything on top (e.g. X11, application) will be slower, so this measurement establishes a lower bound.

Code to simulate key presses and take measurements

I’m running the latencydriver Arduino sketch, with the Arduino IDE configured for:

Teensy 4.0 (USB Type: Serial, CPU Speed: 600 MHz, Optimize: Faster)

Here’s how we set up the pins in the measurement driver Teensy 4.0:

void setup() {
  Serial.begin(9600);

  // Connected to kinT pin 15, COL_2
  pinMode(11, OUTPUT);
  digitalWrite(11, HIGH);

  // Connected to kinT pin 8, ROW_EQL.
  // Pin 11 will be high/low in accordance with pin 10
  // to simulate a key-press, and always high (unpressed)
  // otherwise.
  pinMode(10, INPUT_PULLDOWN);
  attachInterrupt(digitalPinToInterrupt(10), onScan, CHANGE);

  // Connected to the kinT LED_CAPS_LOCK output:
  pinMode(12, INPUT_PULLDOWN);
  attachInterrupt(digitalPinToInterrupt(12), onCapsLockLED, CHANGE);
}

In order to make a key read as pressed, we need to connect the column with the row in the keyboard matrix, but only when the column is scanned. We do that in the interrupt handler like so:

bool simulate_press = false;

void onScan() {
  if (simulate_press) {
    // connect row scan signal with column read
    digitalWrite(11, digitalRead(10));
  } else {
    // always read not pressed otherwise
    digitalWrite(11, HIGH);
  }
}

In our text interface, we can now start a measurement like so:

caps_lock_on_to_off = capsLockOn();
Serial.printf("# Caps Lock key pressed (transition: %s)\r\n",
  caps_lock_on_to_off ? "on to off" : "off to on");
simulate_press = true;
t0 = ARM_DWT_CYCCNT;
emt0 = 0;
eut0 = 0;

The next keyboard matrix scan will detect the key as pressed, send the HID report to the OS, and when the OS responds with its HID report containing the Caps Lock LED status, our Caps Lock LED interrupt handler is called to finish the measurement:

void onCapsLockLED() {
  const uint32_t t1 = ARM_DWT_CYCCNT;
  const uint32_t elapsed_millis = emt0;
  const uint32_t elapsed_micros = eut0;
  uint32_t elapsed_nanos = (t1 - t0) / cycles_per_ns;

  Serial.printf("# Caps Lock LED (pin 12) is now %s\r\n", capsLockOn() ? "on" : "off");
  Serial.printf("# %u ms == %u us\r\n", elapsed_millis, elapsed_micros);
  Serial.printf("BenchmarkKeypressToLEDReport 1 %u ns/op\r\n", elapsed_nanos);
  Serial.printf("\r\n");
}

Running measurements

Connect the Teensy 4.0 to your computer and open its USB serial console:

% screen /dev/ttyACM0 115200

You should be greeted by a welcome message:

# kinT latency measurement driver
#   t  - trigger measurement

To save your measurements to file, use C-a H in screen to make it write to file screenlog.0.

Press t a few times to trigger a few measurements and close screen using C-a k.

You can summarize the measurements using benchstat:

% benchstat screenlog.0
name                 time/op
KeypressToLEDReport  1.82ms ±20%

Scan-to-scan delay

The measurement output on the USB serial console also contains the matrix scan-to-scan delay:

# scan-to-scan delay: 422475 ns

Each keyboard matrix scan turns on each row one-by-one, then reads all the columns.

This means that in each matrix scan, ROW_EQL will be set high once, then low again.

The Teensy 4.0 measures scan-to-scan delay by timing the activations of ROW_EQL.

We can verify this approach by making QMK self-report its scan rate. Enable the matrix scan rate debug option in keyboards/kinesis/keymaps/stapelberg/config.h like so:

#pragma once

#define DEBUG_MATRIX_SCAN_RATE

Using hid_listen we can now see the following QMK debug messages:

% sudo hid_listen
Waiting for new device:..
Listening:
matrix scan frequency: 2300
matrix scan frequency: 2367
matrix scan frequency: 2367

A matrix scan rate/frequency of 2367 scans per second corresponds to 422μs per scan:

1000000 μs / 2367 scans/second = 422μs

Yet another way of verifying the approach is by short-circuiting an end-to-end measurement with a one-line change in our QMK keyboard code:

bool process_action_kb(keyrecord_t *record) {
#define LED_CAPS_LOCK LINE_PIN12
#define ledTurnOn writePinLow
  ledTurnOn(LED_CAPS_LOCK);
  return true;
}

Repeating the measurements, this gives us:

% benchstat screenlog.0     
name                 time/op
KeypressToLEDReport  693µs ±26%

This value is between [0, 2 * 422μs] because a key might be pressed after it was already scanned by the in-progress matrix scan, meaning it will need to wait until the next scan completed (!) before it can be registered as pressed.

Measurement harness

Now that we have our general measurement environment all set up, it’s time to connect our Teensy 4.0 to a few different keyboard controllers!

kint36, kint41: GPIO

If you have an un-soldered micro controller you want to measure, setup is easy: just connect all GPIOs to the Teensy 4.0 latency test driver directly! I’m using this for the kint36 and kint41:

GPIO measurement

(build in /home/michael/kinx/kintpp/rebased, last results in screenlog-kint36-eager-caps.0)

kint2pp: 5V

Because the Teensy++ uses 5V logic levels, we need to convert the levels from/to 3.3V. This is easily done using e.g. the SparkFun Logic Level Converter (Bi-Directional) on a breadboard:

kint2pp with level shifter

kinX: FPC

But what if you have a design where the micro controller doesn’t come standalone, only soldered to a keyboard controller board, such as my earlier kinX controller?

You can use a spare FPC connector (Molex 39-53-2135) and solder jumper wires to the pins for COL_2 and ROW_EQL. For Caps Lock and Ground, I soldered jumper wires to the board:

kinX measurement

Original Kinesis controller

But what if you don’t want to solder jumper wires directly to the board?

The least invasive method is to connect the FPC connector break-out, and hold probe heads onto the contacts while doing your measurements:

kinesis original controller measurement

QMK input latency

Now that the measurement hardware is set up, we can go through the code.

The following sections each cover one possible contributor to input latency.

Eager debounce

Key switches don’t generate a clean signal when pressed, instead they show a ripple effect. Getting rid of this ripple is called debouncing, and every keyboard firmware does it.

See QMK’s documentation on the Debounce API for a good explanation of the differences between the different debounce approaches.

QMK’s default debounce algorithm sym_defer_g is chosen very cautiously. I don’t know what the criteria are specifically for which types of key switches suffer from noise and therefore need the sym_defer_g algorithm, but I know that Cherry MX key switches with diodes like used in the Kinesis Advantage don’t have noise and hence can use the other debounce algorithms, too.

While the default sym_defer_g debounce algorithm is robust, it also adds 5ms of input latency:

% benchstat screenlog-kint36.0
name                 time/op
KeypressToLEDReport  7.61ms ± 8%

For lower input latency, we need an eager algorithm. Specifically, I am chosing the sym_eager_pk debounce algorithm by adding to my keyboards/kinesis/kint36/rules.mk:

DEBOUNCE_TYPE = sym_eager_pk

Now, the extra 5ms are gone:

% benchstat screenlog-kint36-eager.0
name                 time/op
KeypressToLEDReport  2.12ms ±16%

Example change: https://github.com/qmk/qmk_firmware/pull/12626

Quicker USB polling interval

The USB host (computer) divides time into fixed-length segments called frames:

  • USB Full Speed (USB 1.0) uses frames that are 1ms each.
  • USB High Speed (USB 2.0) introduces micro frames, which are 125μs.

Each USB device specifies in its device descriptor how frequently (in frames) the device should be polled. The quickest polling rate for USB 1.0 is 1 frame, meaning the device can send data after at most 1ms. Similarly, for USB 2.0, it’s 1 micro frame, i.e. send data every 125μs.

Of course, a quicker polling rate also means occupying resources on the USB bus which are then no longer available to other devices. On larger USB hubs, this might mean fewer devices can be used concurrently. The specifics of this limitation depend on a lot of other factors, too. The polling rate plays a role, in combination with the max. packet size and the number of endpoints.

Note that we are only talking about concurrent device usage, not about hogging bandwidth: the bulk transfers that USB mass storage devices use are not any slower in my tests. I achieve about 37 MiB/s with or without the kint41 USB 2.0 High Speed controller with bInterval=1 present.

Even connecting two kint41 controllers at the same time still leaves enough resources to use a Logitech C920 webcam in its most bandwidth-intensive pixel format and resolution. The same cannot be said for e.g. NXP’s LPC-Link2 debug probe.

To display the configured interval, the Linux kernel provides a debug pseudo file:

% sudo cat /sys/kernel/debug/usb/devices

[…]
T:  Bus=01 Lev=02 Prnt=09 Port=02 Cnt=02 Dev#= 53 Spd=480  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1209 ProdID=345c Rev= 0.01
S:  Manufacturer="https://github.com/stapelberg"
S:  Product="kinT (kint41)"
C:* #Ifs= 3 Cfg#= 1 Atr=a0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=usbhid
E:  Ad=81(I) Atr=03(Int.) MxPS=   8 Ivl=125us
I:* If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid
E:  Ad=82(I) Atr=03(Int.) MxPS=  32 Ivl=125us
I:* If#= 2 Alt= 0 #EPs= 2 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid
E:  Ad=83(I) Atr=03(Int.) MxPS=  32 Ivl=125us
E:  Ad=04(O) Atr=03(Int.) MxPS=  32 Ivl=125us
[…]

Alternatively, you can display the USB device descriptor using e.g. sudo lsusb -v -d 1209:345c and interpret the bInterval setting yourself.

The above shows the best case: a USB 2.0 High Speed device (Spd=480) with bInterval=1 in its device descriptor (Iv=125us).

The original Kinesis Advantage 2 keyboard controller (KB600) uses USB 2.0, but in Full Speed mode (Spd=12), i.e. no faster than USB 1.1. In addition, they specify bInterval=10, which results in a 10ms polling interval (Ivl=10ms):

T:  Bus=01 Lev=02 Prnt=09 Port=02 Cnt=02 Dev#= 52 Spd=12   MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=29ea ProdID=0102 Rev= 1.00
S:  Manufacturer=Kinesis
S:  Product=Advantage2 Keyboard
C:* #Ifs= 3 Cfg#= 1 Atr=a0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=usbhid
E:  Ad=83(I) Atr=03(Int.) MxPS=   8 Ivl=10ms
I:* If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=usbhid
E:  Ad=84(I) Atr=03(Int.) MxPS=   8 Ivl=2ms
I:* If#= 2 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid
E:  Ad=85(I) Atr=03(Int.) MxPS=   8 Ivl=2ms

My recommendation:

  • With USB 1.1 Full Speed, definitely specify bInterval=1. I’m not aware of any downsides.
  • With USB 2.0 High Speed, I also think bInterval=1 is a good choice, but I am less certain. If you run into trouble, reduce to bInterval=3 and send me a message :)

For details on measuring, see Appendix B: USB polling interval (device side).

Example change: https://github.com/qmk/qmk_firmware/pull/12625

Faster matrix scan

The purpose of a keyboard controller is reporting pressed keys after scanning the key matrix. The more scans a keyboard controller can do per second, the faster it can react to your key press.

How many scans your controller does depends on multiple factors:

  • The clock speed of your micro controller. It’s worth checking if your micro controller model supports running at faster clock speeds, or upgrading your keyboard to a faster model to begin with. There is a point of diminishing returns, which I would guess is at ≈100 MHz. Comparing e.g. the kint36 at 120 MHz vs. 180 MHz, the difference in scan-to-scan is 5μs.

  • How much other code your firmware runs aside from matrix scanning. If you enable any non-standard QMK features, or even self-written code, it’s worth disabling and measuring.

  • Whether you run scans back-to-back or e.g. synchronized with USB start-of-frame interrupts. QMK runs scans back-to-back, so this point is only relevant for other firmwares.

  • How long you need to sleep to let the signal settle. Reducing your sleep times results in more scans per second, but if you don’t sleep long enough, you’ll see ghost key presses. See also the next section about Shorter sleeps.

For details on measuring, see the Scan-to-scan delay section above.

I also tried configuring the GPIOs to be faster to see if that would reduce the required unselect delay, but unfortunately there was no difference between the default setting and the fastest setting: drive strength 6 (DSE=6), fast slew rate (SRE=1), 200 MHz (SPEED=3).

Shorter sleeps

QMK calls ChibiOS’s chThdSleepMicroseconds function in its matrix scanning code. This function unfortunately has a rather long shortest sleep duration of 1 ChibiOS tick: if you tell it to sleep less than 100μs, it will still sleep at least 100μs!

This is a problem on controllers such as the kint41, where we want to sleep for only 10μs.

The length of a ChibiOS tick is determined by how the ARM SysTick timer is set up on the specific micro controller you’re using. While the SysTick timer itself could be configured to fire more frequently, it is not advisable to shorten ChibiOS ticks: chSysTimerHandlerI() must be executable in less than one tick.

Instead, I found it easier to implement short delays by busy-looping until the ARM Cycle Counter Register (CYCCNT) indicates enough time has passed. Here’s an example from keyboards/kinesis/kint41/kint41.c:

// delay_inline sleeps for |cycles| (e.g. sleeping for F_CPU will sleep 1s).
//
// delay_inline assumes the cycle counter has already been initialized and
// should not be modified, i.e. is safe to call during keyboard matrix scan.
//
// ChibiOS enables the cycle counter in chcore_v7m.c.
static void delay_inline(const uint32_t cycles) {
  const uint32_t start = DWT->CYCCNT;
  while ((DWT->CYCCNT - start) < cycles) {
    // busy-loop until time has passed
  }
}

void matrix_output_unselect_delay(void) {
  // 600 cycles at 0.6 cycles/ns == 1μs
  const uint32_t cycles_per_us = 600;
  delay_inline(10 * cycles_per_us);
}

Of course, the cycles/ns value is specific to the frequency at which your micro controller runs, so this code needs to be adjusted for each platform.

Results

With the QMK keyboard firmware configured for lowest input latency, how do the different Kinesis keyboard controller compare? Here are my measurements:

model CPU speed USB poll interval scan-to-scan scan rate caps-to-report
kint41 600 MHz 125μs 181μs 5456 scans/s 930µs ±17%
kinX 120 MHz 125μs 213μs 4694 scans/s 953µs ±15%
kint36 180 MHz 1000μs 444μs 2252 scans/s 1.97ms ±15%
kint2pp 16 MHz 1000μs 926μs 1078 scans/s 3.27ms ±32%
original 60 MHz 10000μs 1936μs 516 scans/s 13.6ms ±21%

The changes required to obtain these results are included since QMK 0.12.38 (2021-04-20).

kint41 support is being added with all required changes to begin with, but still in progress.

The following sections go into detail about the results.

kint41

I am glad that the most recent Teensy 4.1 micro controller takes the lead! The kinX controller achieved similar numbers, but was quite difficult to build, so few people ended up using it.

The key improvement compared to the Teensy 3.6 is the now-available USB 2.0 High Speed, and the powerful clock speed of 600 MHz allows for an even faster matrix scan rate.

kinX

In my previous article about the kinX controller, I measured the kinX scan delay as ≈100μs. During my work on this article, I learnt that the ≈100μs figure was misleading: the measurement code turned off interrupts to measure only the scan function. While that is technically correct, it is not a useful measure, as in practice, interrupts should not be disabled, and the scanning function is interrupted frequently enough that it comes in at ≈208μs.

I also fixed the USB polling interval in the kinX firmware, which wasn’t set to bInterval=1.

Original Kinesis

The original keyboard controller that the Kinesis Advantage 2 (KB600) keyboard comes with uses an AT32UC3B0256 micro controller which is clocked at 60 MHz, but the measured input latency is much higher than even the slowest kint controller (kint2pp at 16 MHz). What gives?

Here’s what we can deduce without access to their firmware:

  1. They seem to be using an eager debounce algorithm (good!), otherwise we would observe even higher latency.
  2. Their USB polling interval setting (bInterval=10) is excessively high, even more so because they are using USB Full Speed with longer USB frames. I would recommend they change it to bInterval=1 for up to 10ms less input latency!
  3. The matrix scan rate is twice as slow as with my kint2pp. I can’t say for sure why this is. Perhaps their firmware does a lot of other things between matrix scans.

Note that we could not apply the Eager Caps Lock LED firmware change to the original controller, which is why the measurement variance is ±21%. This variance includes ± 1.9ms for finishing a matrix scan before updating the LED state.

Conclusion

After analyzing the different controllers in my measurement environment, I think the following factors play the largest role in keyboard input latency, ordered by importance:

  1. Does the firmware use an eager debounce algorithm?
  2. Does the device specify a quick USB polling rate (bInterval setting)?
  3. Is the matrix scan frequency in the expected range, or are there unexpected slow-downs?

Hopefully, this article gives you all the tools you need to measure and reduce keyboard input latency of your own keyboard controller!

Appendix A: isitsnappy

The iPhone app Is It Snappy? records video using the iPhone’s 240 fps camera and allows you to mark the frame that starts respectively ends the measurement.

The app does a good job of making this otherwise tedious process of navigating a video frame by frame much more pleasant.

However, for measuring keyboard input latency, I think this approach is futile:

  • The resolution is too imprecise. At 240 fps, that means each frame represents 4.6ms of time, which is already higher than the input latency of our slowest micro controller.
  • Visually deciding whether a key switch is pressed or not pressed, at frame-perfect precision, seems impossible to me.

I believe the app can work, provided the latency you want to measure is really high. But with the devices covered in this article, the app couldn’t measure even 10ms of injected input latency.

Appendix B: USB polling interval (device side)

You can also verify the USB polling interval on the device side. In the SOF (Start Of Frame) interrupt in tmk_core/protocol/chibios/usb_main.c, we can print the cycle delta to the previous SOF callback, every second:

#include "timer.h"

static uint32_t last_sof = 0;
static uint32_t sof_timer = 0;
void kbd_sof_cb(USBDriver *usbp) {
  (void)usbp;

  uint32_t now = DWT->CYCCNT;
  uint32_t delta = now - last_sof;
  last_sof = now;

  uint32_t timer_now = timer_read32();
  if (TIMER_DIFF_32(timer_now, sof_timer) > 1000) {
    sof_timer = timer_now;
    dprintf("sof delta: %u cycles", delta);
  }
}

Using hid_listen, we expect to see ≈75000 cycles of delta, which corresponds to the 125μs microframe latency of USB 2.0 High Speed with bInterval=1 in the USB device descriptor:

125μs * 1000 * 0.6 cycles/ns = 75000 cycles

at 2021-05-08 13:57

2021-05-07

michael-herbst.com

DFTK: A Julian approach for simulating electrons in solids

Following my talk at Juliacon about our DFTK code last year (slides, recording, blog article), we have now published an extended abstract in the JuliaCon proceedings, which you can find below. The JuliaCon proceedings use the same open journals software stack to manage their publication infrastructure as the Journal of Open-source Software. This stack is actually pretty impressive since it reduces the effort both on the reviewer as well as on the author side to comments within github issues. Since thus the complete exchange (including the review process) is public, this is not only convenient, but also leads to truly transparent publication process. I wish publishing with all journals was like that ...

by Michael F. Herbst at 2021-05-07 22:30 under Publications, talk, electronic structure theory, Julia, HPC, DFTK, theoretical chemistry, SCF, high-throughput

2021-05-01

michael-herbst.com

Thoughts on initial guess methods for DFT

On Thursday I gave a brief talk in our weekly ACED differentiate group meeting about initial guess methods for starting self-consistent field calculations in methods such as density-functional theory. For preparing the talk I did a little digging into both the standard approaches used by many molecular and solid-state codes and did a literature review of some recent ideas motivated from reduced-order modelling or data science. The slides of my talk (which include most references I found) are attached below.

Link
Thoughts on initial guess methods for DFT (Slides)

by Michael F. Herbst at 2021-05-01 10:00 under Research, talk, electronic structure theory, Kohn-Sham, high-throughput, DFT, solid state

2021-04-30

RaumZeitLabor

Halbjähriges im RaumZweitLabor – Aufbauliebe in den Zeiten des Corona

Im neuen RZL feiern wir aktuell noch, wie ein verliebtes Teenagerpärchen, jedes Wochen- und Monatsjubiläum. Unfassbar, dass jetzt schon unser „Halbjähriges“ ansteht. <3

Um euch an unserem Glück teilhaben zu lassen, gibt es hier ein kleines Update, was in der letzten Zeit alles so Aufregendes unter Einhaltung der Corona-Auflagen passiert ist:

Das erste Mal Kisten auspacken, neue Räume umbauen, aufteilen und einräumen; das erste Mal 5 Kubikmeter KMF und andere Altlasten entsorgen; das erste Mal Personenfahrstuhl kaputt, Wasser im Lastenaufzugsschacht im Keller und Kommunikationsprobleme mit dem Vermieter; das erste Mal Rechnungen bei der Versicherung einreichen; das erste Mal Adressänderungen überall; das erste Mal Brandverhütungsschau der Feuerwehr im neuen Raum; das erste Mal Winter mit funktionierender Heizung; das erste Mal schöne, bunte Deko an die neuen Wände packen; das erste Mal Labortische selbst bauen; und vieles, vieles mehr…

Außerdem sind wir Teil des Rats für Kunst und Kultur Mannheim in der Sektion Kulturelle Bildung und Soziokultur geworden, arbeiten im Hintergrund an spannenden Projekten für die Nach-Corona-Zeit und haben die neuen Räume auch endlich mit dem Siegel „Inte approved“ zertifizieren lassen!

Damit aus der frischen Beziehung aber keine schnell verflossene Romanze wird, sind wir weiterhin auf eure Beteiligung bei (Aufbau-)Aktionen angewiesen und freuen uns mehr denn je über einmalige und dauerhafte Zeichen der Zuneigung.

Halbjahr21-Collage

by flederrattie at 2021-04-30 00:00

2021-04-27

sECuREs website

Linux and USB virtual serial devices (CDC ACM)

During my work on Teensy 4.1 support in ChibiOS for the QMK keyboard firmware, I noticed that ChibiOS’s virtual serial device USB demo would sometimes print garbled output, and that I would never see the ChibiOS shell prompt.

This article walks you through diagnosing and working around this issue, in the hope that it helps others who are working with micro controllers and USB virtual serial devices.

Background

Serial interfaces are often the easiest option when working with micro controllers to print text: you only connect GND and the micro controller’s serial TX pin to a USB-to-serial converter. The RX pin is only needed when you want to send text to the micro controller as well.

While conceptually simple, the requirement for an extra piece of hardware (USB-to-serial adapter) is annoying. If your micro controller has a working USB interface and USB stack, a popular alternative is for the micro controller to provide a virtual serial device via USB.

This way, you just need one USB cable between your micro controller and computer, reusing the same connection you already use for programming the device.

A popular choice within this solution is to provide a device conforming to the USB Communications Device Class (CDC) standard, specifically its Abstract Control Model (ACM), which is typically used for modem hardware.

On Linux, these devices show up as e.g. /dev/ttyACM0. In case you’re wondering: /dev/ttyUSB0 device names are used by more specific drivers (vendor-specific). The blog post What is the difference between /dev/ttyUSB and /dev/ttyACM? goes into a lot more detail.

ModemManager

One unfortunate side-effect of using a modem standard to provide a generic serial device is that modem-related software might mistake our micro controller for a modem.

Use the following command to disable ModemManager until the next reboot, which otherwise might open and probe any new serial devices:

% sudo systemctl mask --runtime --now ModemManager

Problem statement

With a regular, non-USB serial interface, you can send data at any time. If nobody is receiving the data on the other end, the micro controller doesn’t care and still writes serial data.

When using the ChibiOS shell with a regular serial interface, this means that if you open the serial interface too late, you will not see the ChibiOS shell prompt. But, if you have the serial interface already opened when powering on your device, you will be greeted by ChibiOS’s shell prompt:

ChibiOS/RT Shell
ch> 

With a USB serial, however, the host will not transfer data from the device until the serial interface is opened. This means that writes to the USB serial can block, whereas writes to the UART serial will not block but may go ignored if nobody is listening.

So when I open the USB serial interface, I would expect to see the ChibiOS shell prompt like above. Instead, I would often not see any prompt at all, and I would even sometimes see garbled output like this:

cch> biOS/RT She

USB analysis with Wireshark

Wireshark allows us to analyze USB traffic in combination with the usbmon Linux kernel module.

Looking through the captured packets, I noticed unexpected packets from the host (computer) to the device (micro controller), specifically containing the following bytes:

  1. hex 0xa = ASCII \n
  2. hex 0xd = ASCII \r

Seeing any packets in this direction is unexpected, because I am only opening the serial interface for reading, and I am not consciously sending anything. So where do the packets come from?

To verify I am not missing any nuance of the CDC protocol, I added debug statements to the ChibiOS shell to log any incoming data. The \n\r bytes indeed make it to the ChibiOS shell.

When the shell receives a line break, it prints a new prompt. This seems to be the reason why I’m seeing garbled data: while the output is transferred to the host, line breaks are received, causing more data transfers. It’s as if somebody was hammering the return key really quickly.

Linux tty echo vs. ChibiOS shell banner

The unexpected \n\r bytes turn out to come from the Linux USB CDC ACM driver, or its interplay with the Linux tty driver, to be specific. The CDC ACM driver is a kind of tty driver, so it is built atop the Linux tty infrastructure, whose standard settings include various ECHO flags.

When echoing is enabled, the ChibiOS shell banner triggers echo characters, which in turn are interpreted as input to the shell, causing garbled output.

So why is echoing enabled? Wouldn’t a terminal emulator turn off echoing first thing?

Yes. But, when the CDC ACM driver receives the first data transfer via USB (already queued), the standard tty settings are still in effect, because the application did not yet have a chance to set its tty configuration up!

This can be verified by running the following command on a Linux host:

% stty -F /dev/ttyACM0 115200 -echo -echoe -echok

Even though the command’s sole purpose is to configure the tty, its opening of the device still causes the banner to print, and echoing to happen, and garbled output is the result.

It turns out this is a somewhat common problem. Hence, the Linux USB CDC ACM driver has a quirks table, in which devices that print a banner select the DISABLE_ECHO quirk, which results in the CDC ACM driver turning off the echoing termios flag early:

static const struct usb_device_id acm_ids[] = {
	/* quirky and broken devices */
	{ USB_DEVICE(0x0424, 0x274e), /* Microchip Technology, Inc. */
	  .driver_info = DISABLE_ECHO, }, /* DISABLE ECHO in termios flag */
// …

So, a quick solution to turn off echoing early is to change your USB vendor and product id (VID/PID) to an ID for which the Linux kernel applies the DISABLE_ECHO quirk, e.g.:

#define USB_DEVICE_VID 0x0424
#define USB_DEVICE_PID 0x274e

Flushing in Screen

With tty echo disabled, I don’t see garbled output anymore, but still wouldn’t always see the ChibiOS shell prompt!

This issue turned out to be specific to the terminal emulator program I’m using. For many years, I have been using Screen for serial devices of any sort.

I was surprised to learn during this investigation that Screen flushes any pending output when opening the device. This typically isn’t a problem because adapter-backed serial devices are opened once and then stay open. USB virtual serial devices however are only opened when used, and disappear when loading new program code onto your micro controller.

I verified this is the problem by using cat(1) instead, with which I can indeed see the prompt:

% cat /dev/ttyACM0

ChibiOS/RT Shell
                
                ch> 

After commenting out the flush call in Screen’s sources, I could see the prompt in Screen as well.

Line ending conversion

Now that we no longer flush the prompt away, why is the spacing still incorrect, and where does it go wrong?


ChibiOS/RT Shell
                
                ch> 

If we use strace(1) to see what screen(1) or cat(1) read from the driver, we see:

797270 read(7, "\n\nChibiOS/RT Shell\n\nch> ", 4096) = 24

We would have expected "\r\nChibiOS/RT Shell\r\nch> " instead, meaning all Carriage Returns (\r) have been translated to Newlines (\n).

This is again due to the Linux tty driver’s default termios settings: c_iflag enables option ICRNL by default, which translates CR (Carriage Return) to NL (Newline).

Unfortunately, contrary to the DISABLE_ECHO quirk, there is no corresponding quirk in the Linux ACM driver to turn off line ending conversion, so a fix would need a Linux kernel driver change!

Device-side workaround: wait until opened

At this point, we have covered a few problems that would need to be fixed:

  1. Change USB VID/PID to get the DISABLE_ECHO quirk in the driver.
  2. Recompile terminal emulator programs to remove flushing, if needed.
  3. Modify kernel driver to add quirk to disable Carriage Return (\r) conversion.

Time for a quick reality check: this seems too hard and too long a time for all parts of the stack to be fixed. Is there an easier way, and why don’t others run into this problem? If only the device didn’t print its banner so early, that would circumvent all of the problems above, too!

Luckily, the host actually notifies the device when a terminal emulator program opens the USB serial device by sending a CDC_SET_CONTROL_LINE_STATE request. I verified this behavior on Linux, Windows and macOS.

So, let’s implement a workaround in our device code! We will delay starting the shell until:

  1. The USB serial device was opened (not just configured).
  2. An additional delay of 100ms has passed to give the terminal emulator application a chance to configure the serial device.

In our main.c loop, we wait until USB is active, and until we receive the first CDC_SET_CONTROL_LINE_STATE request because the serial port was opened:

  while (true) {
    if (SDU1.config->usbp->state == USB_ACTIVE) {
      chSemWait(&scls);
      chThdSleepMilliseconds(100);

      thread_t *shelltp = chThdCreateFromHeap(NULL, SHELL_WA_SIZE, "shell", NORMALPRIO + 1, shellThread, (void *)&shell_cfg1);
      chThdWait(shelltp);
    }
  }

And in our usbcfg.c, when receiving a CDC_SET_CONTROL_LINE_STATE request, we will reset the semaphore to non-blockingly wake up all waiters:

extern semaphore_t scls;

bool requests_hook(USBDriver *usbp) {
  const bool result = sduRequestsHook(usbp);

  if ((usbp->setup[0] & USB_RTYPE_TYPE_MASK) == USB_RTYPE_TYPE_CLASS &&
      usbp->setup[1] == CDC_SET_CONTROL_LINE_STATE) {
    osalSysLockFromISR();
    chSemResetI(&scls, 0);
    osalSysUnlockFromISR();
  }

  return result;
}

Screenshots: Mac and Windows

Aside from Linux, I also verified the workaround works on a Mac (with Screen):

USB virtual serial device on macOS

…and that it works on Windows (with PuTTY):

USB virtual serial device on Windows 10

at 2021-04-27 06:18

2021-04-09

michael-herbst.com

A novel black-box preconditioning strategy for high-throughput density-functional theory

A couple of weeks ago, from 15th to 19th March, I participated in the virtual annual meeting of the German Association of Applied Mathematics and Mechanics (GAMM). For me this meeting was the first time I presented my work to an audience of applied mathematicians with a broad background and no inherent interest in quantum chemistry. With only 15 minutes for my talk in the "scientific computing" track preparing the material was quite a challenge. I hope I still managed to convey the main ideas of our recently published LDOS preconditioner in a broadly accessible way. My slides are attached below.

Link
A novel black-box preconditioning strategy for high-throughput density-functional theory (Slides)

by Michael F. Herbst at 2021-04-09 16:00 under Research, talk, electronic structure theory, Julia, DFTK, theoretical chemistry, numerical analysis, Kohn-Sham, high-throughput, DFT, solid state

2021-04-02

sECuREs website

Emacs: overriding the project.el project directory

I recently learnt about the Emacs package project.el, which is used to figure out which files and directories belong to the same project. This is used under the covers by Eglot, for example.

In practice, a project is recognized by looking for Git repositories, which is a decent first approximation that often just works.

But what if the detection fails? For example, maybe you want to anchor your project-based commands in a parent directory that contains multiple Git repositories.

Luckily, we can provide our own entry to the project-find-functions hook, and look for a .project.el file in the parent directories:

;; Returns the parent directory containing a .project.el file, if any,
;; to override the standard project.el detection logic when needed.
(defun zkj-project-override (dir)
  (let ((override (locate-dominating-file dir ".project.el")))
    (if override
      (cons 'vc override)
      nil)))

(use-package project
  ;; Cannot use :hook because 'project-find-functions does not end in -hook
  ;; Cannot use :init (must use :config) because otherwise
  ;; project-find-functions is not yet initialized.
  :config
  (add-hook 'project-find-functions #'zkj-project-override))

Now, we can use touch .project.el in any directory to make project.el recognize the directory as project root!

By the way, in case you are unfamiliar, the configuration above uses use-package, which is a great way to (lazily, i.e. quickly!) load and configure Emacs packages.

at 2021-04-02 12:08