December 02, 2021

hackergotchi for Paul Tagliamonte

Paul Tagliamonte

Processing IQ data formats (Part 1/5) 🐀

🐀 This post is part of a series called "PACKRAT". If this is the first post you've found, it'd be worth reading the intro post first and then looking over all posts in the series.

When working with SDRs, information about the signals your radio is receiving are communicated by streams of IQ data. IQ is short for “In-phase” and “Quadrature”, which means 90 degrees out of phase. Values in the IQ stream are commonly treated as complex numbers because it helps greatly when processing the IQ data for meaning.

I won’t get too deep into what IQ is or why complex numbers (mostly since I don’t think I fully understand it well enough to explain it yet), but here’s some basics in case this is your first interaction with IQ data before going off and reading more.

Before we get started — at any point, if you feel lost in this post, it's OK to take a break to do a bit of learning elsewhere in the internet. I'm still new to this, so I'm sure my overview in one paragraph here won't help clarify things too much. This took me months to sort out on my own. It's not you, really! I particularly enjoyed reading visual-dsp.switchb.org when it came to learning about how IQ represents signals, and Software-Defined Radio for Engineers for a more general reference.

Each value in the stream is taken at a precisely spaced sampling interval (called the sampling rate of the radio). Jitter in that sampling interval, or a drift in the requested and actual sampling rate (usually represented in PPM, or parts per million – how many samples out of one million are missing) can cause errors in frequency. In the case of a PPM error, one radio may think it’s 100.1MHz and the other may think it’s 100.2MHz, and jitter will result in added noise in the resulting stream.

A single IQ sample is both the real and imaginary values, together. The complex number (both parts) is the sample. The number of samples per second is the number of real and imaginary value pairs per second.

Each sample is reading the electrical energy coming off the antenna at that exact time instant. We’re looking to see how that goes up and down over time to determine what frequencies we’re observing around us. If the IQ stream is only real-valued measures (e.g., float values rather than complex values reading voltage from a wire), you can still send and receive signals, but those signals will be mirrored across your 0Hz boundary. That means if you’re tuned to 100MHz, and you have a nearby transmitter at 99.9MHz, you’d see it at 100.1MHz. If you want to get an intuitive understanding of this concept before getting into the heavy math, a good place to start is looking at how Quadrature encoders work. Using complex numbers means we can see “up” in frequency as well as “down” in frequency, and understand that those are different signals.

The reason why we need negative frequencies is that our 0Hz is the center of our SDR’s tuned frequency, not actually at 0Hz in nature. Generally speaking, it’s doing loads in hardware (and firmware!) to mix the raw RF signals with a local oscillator to a frequency that can be sampled at the requested rate (fundamentally the same concept as a superheterodyne receiver), so a frequency of ‘-10MHz’ means that signal is 10 MHz below the center of our SDR’s tuned frequency.

The sampling rate dictates the amount of frequency representable in the data stream. You’ll sometimes see this called the Nyquist frequency. The Nyquist Frequency is one half of the sampling rate. Intuitively, if you think about the amount of bandwidth observable as being 1:1 with the sampling rate of the stream, and the middle of your bandwidth is 0 Hz, you would only have enough space to go up in frequency for half of your bandwidth – or half of your sampling rate. Same for going down in frequency.

Float 32 / Complex 64

IQ samples that are being processed by software are commonly processed as an interleaved pair of 32 bit floating point numbers, or a 64 bit complex number. The first float32 is the real value, and the second is the imaginary value.

I#0
Q#0
I#1
Q#1
I#2
Q#2

The complex number 1+1i is represented as 1.0 1.0 and the complex number -1-1i is represented as -1.0 -1.0. Unless otherwise specified, all the IQ samples and pseudocode to follow assumes interleaved float32 IQ data streams.

Example interleaved float32 file (10Hz Wave at 1024 Samples per Second)

RTL-SDR

IQ samples from the RTL-SDR are encoded as a stream of interleaved unsigned 8 bit integers (uint8 or u8). The first sample is the real (in-phase or I) value, and the second is the imaginary (quadrature or Q) value. Together each pair of values makes up a complex number at a specific time instant.

I#0
Q#0
I#1
Q#1
I#2
Q#2

The complex number 1+1i is represented as 0xFF 0xFF and the complex number -1-1i is represented as 0x00 0x00. The complex number 0+0i is not easily representable – since half of 0xFF is 127.5.

Complex Number Representation
1+1i []uint8{0xFF, 0xFF}
-1+1i []uint8{0x00, 0xFF}
-1-1i []uint8{0x00, 0x00}
0+0i []uint8{0x80, 0x80} or []uint8{0x7F, 0x7F}

And finally, here’s some pseudocode to convert an rtl-sdr style IQ sample to a floating point complex number:

...
in = []uint8{0x7F, 0x7F}
real = (float(iq[0])-127.5)/127.5
imag = (float(iq[1])-127.5)/127.5
out = complex(real, imag)
....

Example interleaved uint8 file (10Hz Wave at 1024 Samples per Second)

HackRF

IQ samples from the HackRF are encoded as a stream of interleaved signed 8 bit integers (int8 or i8). The first sample is the real (in-phase or I) value, and the second is the imaginary (quadrature or Q) value. Together each pair of values makes up a complex number at a specific time instant.

I#0
Q#0
I#1
Q#1
I#2
Q#2

Formats that use signed integers do have one quirk due to two’s complement, which is that the smallest negative number representable’s absolute value is one more than the largest positive number. int8 values can range between -128 to 127, which means there’s bit of ambiguity in how +1, 0 and -1 are represented. Either you can create perfectly symmetric ranges of values between +1 and -1, but 0 is not representable, have more possible values in the negative range, or allow values above (or just below) the maximum in the range to be allowed.

Within my implementation, my approach has been to scale based on the max integer value of the type, so the lowest possible signed value is actually slightly smaller than -1. Generally, if your code is seeing values that low the difference in step between -1 and slightly less than -1 isn’t very significant, even with only 8 bits. Just a curiosity to be aware of.

Complex Number Representation
1+1i []int8{127, 127}
-1+1i []int8{-128, 127}
-1-1i []int8{-128, -128}
0+0i []int8{0, 0}

And finally, here’s some pseudocode to convert a hackrf style IQ sample to a floating point complex number:

...
in = []int8{-5, 112}
real = (float32(in[0]))/127
imag = (float32(in[1]))/127
out = complex(real, imag)
....

Example interleaved int8 file (10Hz Wave at 1024 Samples per Second)

PlutoSDR

IQ samples from the PlutoSDR are encoded as a stream of interleaved signed 16 bit integers (int16 or i16). The first sample is the real (in-phase or I) value, and the second is the imaginary (quadrature or Q) value. Together each pair of values makes up a complex number at a specific time instant.

Almost no SDRs capture at a 16 bit depth natively, often you’ll see 12 bit integers (as is the case with the PlutoSDR) being sent around as 16 bit integers. This leads to the next possible question, which is are values LSB or MSB aligned? The PlutoSDR sends data LSB aligned (which is to say, the largest real or imaginary value in the stream will not exceed 4095), but expects data being transmitted to be MSB aligned (which is to say the lowest set bit possible is the 5th bit in the number, or values can only be set in increments of 16).

As a result, the quirk observed with the HackRF (that the range of values between 0 and -1 is different than the range of values between 0 and +1) does not impact us so long as we do not use the whole 16 bit range.

Complex Number Representation
1+1i []int16{32767, 32767}
-1+1i []int16{-32768, 32767}
-1-1i []int16{-32768, -32768}
0+0i []int16{0, 0}

And finally, here’s some pseudocode to convert a PlutoSDR style IQ sample to a floating point complex number, including moving the sample from LSB to MSB aligned:

...
in = []int16{-15072, 496}
// shift left 4 bits (16 bits - 12 bits = 4 bits)
 // to move from LSB aligned to MSB aligned.
 in[0] = in[0] << 4
in[1] = in[1] << 4
real = (float32(in[0]))/32767
imag = (float32(in[1]))/32767
out = complex(real, imag)
....

Example interleaved i16 file (10Hz Wave at 1024 Samples per Second)

Next Steps

Now that we can read (and write!) IQ data, we can get started first on the transmitter, which we can (in turn) use to test receiving our own BPSK signal, coming next in Part 2!

02 December, 2021 05:00PM

Intro to PACKRAT (Part 0/5) 🐀

Hello! Welcome. I’m so thrilled you’re here.

Some of you may know this (as I’ve written about in the past), but if you’re new to my RF travels, I’ve spent nights and weekends over the last two years doing some self directed learning on how radios work. I’ve gone from a very basic understanding of wireless communications, all the way through the process of learning about and implementing a set of libraries to modulate and demodulate data using my now formidable stash of SDRs. I’ve been implementing all of the RF processing code from first principals and purely based on other primitives I’ve written myself to prove to myself that I understand each concept before moving on.

I’ve just finished a large personal milestone – I was able to successfully send a cURL HTTP request through a network interface into my stack of libraries, through my own BPSK implementation, framed in my own artisanal hand crafted Layer 2 framing scheme, demodulated by my code on the other end, and sent into a Linux network interface. The combination of the Layer 1 PHY and Layer 2 Data Link is something that I’ve been calling “PACKRAT”.

$ curl http://44.127.0.8:8000/
* Connected to 44.127.0.8 (44.127.0.8) port 8000 (#0)
> GET / HTTP/1.1
> Host: localhost:1313
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
* HTTP/1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Length: 236
<
____ _ ____ _ ______ _ _____
| _ \ / \ / ___| |/ / _ \ / \|_ _|
| |_) / _ \| | | ' /| |_) | / _ \ | |
| __/ ___ \ |___| . \| _ < / ___ \| |
|_| /_/ \_\____|_|\_\_| \_\/_/ \_\_|
* Closing connection 0

In an effort to “pay it forward” to thank my friends for their time walking me through huge chunks of this, and those who publish their work, I’m now spending some time documenting how I was able to implement this protocol. I would never have gotten as far as I did without the incredible patience and kindness of friends spending time working with me, and educators publishing their hard work for the world to learn from. Please accept my deepest thanks and appreciation.

The PACKRAT posts are written from the perspective of a novice radio engineer, but experienced software engineer. I’ll be leaving out a lot of the technical details on the software end and specific software implementation, focusing on the general gist of the implementation in the radio critical components exclusively. The idea here is this is intended to be a framework – a jumping off point – for those who are interested in doing this themselves. I hope that this series of blog posts will come to be useful to those who embark on this incredibly rewarding journey after me.

This is the first post in the series, and it will contain links to all the posts to follow. This is going to be the landing page I link others to – as I publish additional posts, I’ll be updating the links on this page. The posts will also grow a tag, which you can check back on, or follow along with here.

Tau

Tau (𝜏) is a much more natural expression of the mathematical constant used for circles which I use rather than Pi (π). You may see me use Tau in code or text – Tau is the same as 2π, so if you see a Tau and don’t know what to do, feel free to mentally or textually replace it with 2π. I just hate always writing 2π everywhere – and only using π (or worse yet – 2π/2) .when I mean 1/2 of a circle (or, 𝜏/2).

Psuedo-code

Basicaly none of the code contained in this series is valid on its own. It’s very lightly basically Go, and only meant to express concepts in term of software. The examples in the post shouldn’t be taken on their own as working snippits to process IQ data, but rather, be used to guide implementations to process the data in question. I’d love to invite all readers to try to “play at home” with the examples, and try and work through the example data captures!

Captures

Speaking of captures, I’ve included live on-the-air captures of PACKRAT packets, as transmitted from my implementation, in different parts of these posts. This means you can go through the process of building code to parse and receive PACKRAT packets, and then build a transmitter that is validated by your receiver. It’s my hope folks will follow along at home and experiment with software to process RF data on their own!

Posts in this series

02 December, 2021 04:00PM

hackergotchi for Steve Kemp

Steve Kemp

It has been some time..

I realize it has been quite some time since I last made a blog-post, so I guess the short version is "I'm still alive", or as Granny Weatherwax would have said:

I ATE'NT DEAD

Of course if I die now this would be an awkward post!

I can't think of anything terribly interesting I've been doing recently, mostly being settled in my new flat and tinkering away with things. The latest "new" code was something for controlling mpd via a web-browser:

This is a simple HTTP server which allows you to minimally control mpd running on localhost:6600. (By minimally I mean literally "stop", "play", "next track", and "previous track").

I have all my music stored on my desktop, I use mpd to play it locally through a pair of speakers plugged into that computer. Sometimes I want music in the sauna, or in the bedroom. So I have a couple of bluetooth speakers which are used to send the output to another room. When I want to skip tracks I just open the mpd-web site on my phone and tap the button. (I did look at android mpd-clients, but at the same time it seemed like installing an application for this was a bit overkill).

I guess I've not been doing so much "computer stuff" outside work for a year or so. I guess lack of time, lack of enthusiasm/motivation.

So looking forward to things? I'll be in the UK for a while over Christmas, barring surprises. That should be nice as I'll get to see family, take our child to visit his grandparents (on his birthday no less) and enjoy playing the "How many Finnish people can I spot in the UK?" game

02 December, 2021 03:00PM

hackergotchi for Dirk Eddelbuettel

Dirk Eddelbuettel

drat 0.2.2 on CRAN: Package Maintenance

drat user

A fresh and new minor release of drat arrived on CRAN overnight. This is another small update relative to the 0.2.0 release in April followed by a 0.2.1 update in July. This release follows the changes made in digest yesterday. We removed the YAML file (and badge) for the disgraced former continuous integration service we shall not name (yet that we all used to use). And we converted the vignette from using the minidown package to the (fairly new) simplermarkdown package which is so much more appropriate for our use of the minimal water.css style.

drat stands for drat R Archive Template, and helps with easy-to-create and easy-to-use repositories for R packages. Since its inception in early 2015 it has found reasonably widespread adoption among R users because repositories with marked releases is the better way to distribute code. See below for a few custom reference examples.

Because for once it really is as your mother told you: Friends don’t let friends install random git commit snapshots. Properly rolled-up releases it is. Just how CRAN shows us: a model that has demonstrated for two-plus decades how to do this. And you can too: drat is easy to use, documented by six vignettes and just works.

Detailed information about drat is at its documentation site.

The NEWS file summarises the release as follows:

Changes in drat version 0.2.2 (2021-12-01)

  • Travis artifacts and badges have been pruned

  • Vignettes now use simplermarkdown

Courtesy of my CRANberries, there is a comparison to the previous release. More detailed information is on the drat page as well as at the documentation site.

If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

02 December, 2021 01:33PM

December 01, 2021

hackergotchi for Junichi Uekawa

Junichi Uekawa

December.

December. The world is turbulent and I am still worried where we are going.

01 December, 2021 11:56PM by Junichi Uekawa

Thorsten Alteholz

My Debian Activities in November 2021

FTP master

This month I accepted 564 and rejected 93 packages. The overall number of packages that got accepted was 591.

Debian LTS

This was my eighty-ninth month that I did some work for the Debian LTS initiative, started by Raphael Hertzog at Freexian.

This month my all in all workload has been 40h. During that time I did LTS and normal security uploads of:

  • [DLA 2820-1] atftp security update for two CVEs
  • [DLA 2821-1] axis security update for one CVE
  • [DLA 2822-1] netkit-rsh security update for two CVEs
  • [DLA 2825-1] libmodbus security update for two CVEs
  • [#1000408] for libmodbus in Buster
  • [#1000485] for btrbk in Bullseye
  • [#1000486] for btrbk in Buster

I also started to work on pgbouncer to get an update for each release and had to process packages from NEW on security-master.

Further I worked on a script to automatically publish DLAs on the Debian website, that are posted to debian-lts-announce. The script can be found on salsa. It only publishes stuff from people on a whitelist. At the moment it is running on a computer at home. You might run your own copy, or just send me an email to be put on the whitelist as well.

Last but not least I did some days of frontdesk duties.

Debian ELTS

This month was the forty-first ELTS month.

During my allocated time I uploaded:

  • ELA-517-1 for atftp
  • ELA-519-1 for qtbase-opensource-src
  • ELA-520-1 for libsdl1.2
  • ELA-521-1 for libmodbus

Last but not least I did some days of frontdesk duties.

Debian Printing

Unfortunately I did not do as much as I wanted this month. At least I looked at some old bugs and uploaded new upstream versions of …

I hope this will improve in December again. New versions of cups and hplip are on my TODO-list.

Debian Astro

This month I uploaded new versions of …

Other stuff

I improved packaging or fixed bugs of:

01 December, 2021 03:33PM by alteholz

hackergotchi for Dirk Eddelbuettel

Dirk Eddelbuettel

digest 0.6.29 on CRAN: Package Maintenance

Release 0.6.29 of the digest package arrived at CRAN earlier today, and will be uploaded Debian shortly.

digest creates hash digests of arbitrary R objects (using the md5, sha-1, sha-256, sha-512, crc32, xxhash32, xxhash64, murmur32, spookyhash, and blake3 algorithms) permitting easy comparison of R language objects. It is a mature and widely-used as many tasks may involve caching of objects for which it provides convenient general-purpose hash key generation.

This release only contains two smaller internal changes. We removed the YAML file (and badge) for the disgraced former continuous integration service we shall not name (yet that we all used to use). And we converted the vignette from using the minidown package to the (fairly new) simplermarkdown package which is so much more appropriate for our use of the minimal water.css style.

My CRANberries provides the usual summary of changes to the previous version. For questions or comments use the issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

01 December, 2021 12:31PM

Russ Allbery

Review: A World Without Email

Review: A World Without Email, by Cal Newport

Publisher: Portfolio/Penguin
Copyright: 2021
ISBN: 0-525-53657-4
Format: Kindle
Pages: 264

A World Without Email is the latest book by computer science professor and productivity writer Cal Newport. After a detour to comment on the drawbacks of social media in Digital Minimalism, Newport is back to writing about focus and concentration in the vein of Deep Work. This time, though, the topic is workplace structure and collaborative process rather than personal decisions.

This book is a bit hard for me to review because I spoiled myself for the contents by listening to a lot of Newport's podcast, where he covers the same material. I therefore didn't enjoy it as much as I otherwise would have because the ideas were familiar. I recommend the book over the podcast, though; it's tighter, more coherent, and more comprehensive.

The core contention of this book is that knowledge work (roughly, jobs where one spends significant time working on a computer processing information) has stumbled into a superficially tempting but inefficient and psychologically harmful structure that Newport calls the hyperactive hive mind. This way of organizing work is a local maxima: it feels productive, it's flexible and very easy to deploy, and most minor changes away from it make overall productivity worse. However, the incentive structure is all wrong. It prioritizes quick responses and coordination overhead over deep thinking and difficult accomplishments.

The characteristic property of the hyperactive hive mind is free-flowing, unstructured communication between co-workers. If you need something from someone else, you ask them for it and they send it to you. The "email" in the title is not intended literally; Slack and related instant messaging apps are even more deeply entrenched in the hyperactive hive mind than email is. The key property of this workflow is that most collaborative work is done by contacting other people directly via ad hoc, unstructured messages.

Newport's argument is that this workflow has multiple serious problems, not the least of which is that it makes us miserable. If you have read his previous work, you will correctly expect this to tie into his concept of deep work. Ad hoc, unstructured communication creates a constant barrage of unimportant small tasks and interrupts, most of which require several asynchronous exchanges before your brain can stop tracking the task. This creates constant context-shifting, loss of focus and competence, and background stress from ever-growing email inboxes, unread message notifications, and the semi-frantic feeling that you're forgetting something you need to do.

This is not an original observation, of course. Many authors have suggested individual ways to improve this workflow: rules about how often to check one's email, filtering approaches, task managers, and other personal systems. Newport's argument is that none of these individual approaches can address the problem due to social effects. It's all well and good to say that you should unplug from distractions and ignore requests while you concentrate, but everyone else's workflow assumes that their co-workers are responsive to ad hoc requests. Ignoring this social contract makes the job of everyone still stuck in the hyperactive hive mind harder. They won't appreciate that, and your brain will not be able to relax knowing that you're not meeting your colleagues' expectations.

In Newport's analysis, the necessary solution is a comprehensive redesign of how we do knowledge work, akin to the redesign of factory work that came with the assembly line. It's a collective problem that requires a collective solution. In other industries, organizing work for efficiency and quality is central to the job of management, but in knowledge work (for good historical reasons) employees are mostly left to organize their work on their own. That self-organization has produced a system that doesn't require centralized coordination or decisions and provides a lot of superficial flexibility, but which may be significantly inferior to a system designed for how people think and work.

Even if you find this convincing (and I think Newport makes a good case), there are reasons to be suspicious of corporations trying to make people more productive. The assembly line made manufacturing much more efficient, but it also increased the misery of workers so much that Henry Ford had to offer substantial raises to retain workers. As one of Newport's knowledge workers, I'm not enthused about that happening to my job.

Newport recognizes this and tries to address it by drawing a distinction between the workflow (how information moves between workers) and the work itself (how individual workers solve problems in their area of expertise). He argues that companies need to redesign the former, but should leave the latter to each worker. It's a nice idea, and it will probably work in industries like tech with substantial labor bargaining power. I'm more cynical about other industries.

The second half of the book is Newport's specific principles and recommendations for designing better workflows that don't rely on unstructured email. Some of this will be familiar (and underwhelming) to anyone who works in tech; Newport recommends ticket systems and thinks agile, scrum, and kanban are pointed in the right direction. But there are some other good ideas in here, such as embracing specialization.

Newport argues (with some evidence) that the drastic reduction in secretarial jobs, on the grounds that workers with computers can do the same work themselves, was a mistake. Even with new automation, this approach increased the range of tasks required in every other job. Not only was this a drain on the time of other workers, it caused more context switching, which made everyone less efficient and undermined work quality. He argues for reversing that trend: where the work cannot be automated, hire more support workers and more specialized workers in general, stop expecting everyone to be their own generalist admin, and empower support workers to create better systems rather than using the hyperactive hive mind model to answer requests.

There's more here, ranging from specifics of how to develop a structured process for a type of work to the importance of enabling sustained concentration on a task. It's a less immediately actionable book than Newport's previous writing, but I welcome the partial shift in focus to more systemic issues. Newport continues to be relentlessly apolitical, but here it feels less like he's eliding important analysis and more like he thinks the interests of workers and good employers are both served by the approach he's advocating.

I will warn that Newport leans heavily on evolutionary psychology in his argument that the hyperactive hive mind is bad for us. I think he has some good arguments about the anxiety that comes with not responding to requests from others, but I'm not sure intrusive experiments on spectacularly-unusual remnant hunter-gatherer groups, who are treated like experimental animals, are the best way of making that case. I realize this isn't Newport's research, but I think he could have made his point with more directly relevant experiments.

He also continues his obsession with the superiority of in-person conversation over written communication, and while he has a few good arguments, he has a tendency to turn them into sweeping generalizations that are directly contradicted by, well, my entire life. It would be nice if he were more willing to acknowledge that it's possible to express deep emotional nuance and complex social signaling in writing; it simply requires a level of practice and familiarity (and shared vocabulary) that's often missing from the workplace.

I was muttering a lot near the start of this book, but thankfully those sections are short, and I think the rest of his argument sits on a stronger foundation.

I hope Newport continues moving in the direction of more systemic analysis. If you enjoyed Deep Work, you will probably find A World Without Email interesting. If you're new to Newport, this is not a bad place to start, particularly if you have influence on how communication is organized in your workplace. Those who work in tech will find some bits of this less interesting, but Newport approaches the topic from a different angle than most agile books and covers a broader range if ideas.

Recommended if you like reading this sort of thing.

Rating: 7 out of 10

01 December, 2021 05:07AM

Paul Wise

FLOSS Activities November 2021

Focus

This month I didn't have any particular focus. I just worked on issues in my info bubble.

Changes

Issues

Review

Administration

  • Debian BTS: unarchive/reopen/triage bugs for reintroduced packages
  • Debian wiki: unblock IP addresses, approve accounts

Communication

  • Respond to queries from Debian users and contributors on the mailing lists and IRC

Sponsors

The SPTAG, visdom, gensim, purple-discord, plac, fail2ban, uvloop work was sponsored by my employer. All other work was done on a volunteer basis.

01 December, 2021 02:52AM

November 30, 2021

Russell Coker

hackergotchi for Steinar H. Gunderson

Steinar H. Gunderson

Commitcoin

How do you get a git commit with an interesting commit ID (or “SHA”)? Of course, interesting is in the eye of the beholder, but let's define it as having many repeated hex nibbles, e.g. “000” in the commit would be somewhat interesting and “8888888888888888888888888” would be very interesting. This is pretty similar to the dreaded cryptocoin mining; we have no simple way of forcing a given SHA-1 hash unless someone manages a complete second-preimage break, so we must brute-force. (And hopefully without boiling the planet in the process; we'd have to settle for a bit shorter runs than in the example above.)

Git commit IDs are SHA-1 checksums of what they contain; the tree object (“what does the commit contain”), the parents, the commit message and some dates. Of those, let's use the author date as the nonce (I chose to keep the committer date truthful, so as to not be accused of forging history too much). We can set up a shell script to commit with --amend, sweeping GIT_AUTHOR_DATE over the course of a day or so and having EDITOR=true in order not to have to close the editor all the time.

It turns out this is pretty slow (unsurprisingly!). So we discover that actually launching the “editor” takes a long time, and --no-edit is much faster. We can also move to a tmpfs in order not to be block on fsync and block allocation (eatmydata would also work, but doesn't fix the filesystem overhead). At this point, we're at roughly 50 commits/sec or so. So we can sweep through the entire day of author dates, and if nothing interesting comes up, we can just try again (as we also get a new committer date, we've essentially reset our random generator).

But we can do much better than this. A commit in git is many different things; load the index, see if we need to add something, then actually make the commit object and finally update HEAD and whatever branch we might be on. Of those, we only really need to make the commit object and see what it hash ended up with! So we change our script to use git commit-tree instead, and whoa, we're up to 300 commits/sec.

Now we're bottlenecked at the time it takes to fork and launch the git binary—so we can hack the git sources and move the date sweep into builtin/commit-tree.c. This is radically faster; about 100 times as fast! Now what takes time is compressing and creating the commit object.

But OK, my 5950X has 16 cores, right, so we can just split the range in 16 and have different cores test different ranges? Wrong! Because now, the entire sweep takes less than a second, so we no longer get the different committer date and the cores are testing the same SHA over and over. (In effect, our nonce space is too small.) We cheat a bit and add extra whitespace to the end of the commit message to get a larger parameter space; the core ID determines how many spaces.

At this point, you can make commits so fast that the problem essentially becomes that you run out of space, and need to run git prune every few seconds. So the obvious next step would be to not compress and write out the commits at all… and then, I suppose, optimize the routines to not call any git stuff anymore, and then have GPUs do the testing, and of course, finally we'll have Gitcoin ASICs, and every hope of reaching the 1.5 degree goal is lost…

Did I say Gitcoin? No, unfortunately that name was already taken. So I'll call it Commitcoin. And I'm satisifed with a commit containing dddddddd, even though it's of course possible to do much better—hardness is only approximately 2^26 commits to get a commit as interesting as that.

(Cryptobros, please stay out of my inbox. I'm not interested.)

30 November, 2021 11:00AM

Russell Coker

Your Device Has Been Improved

I’ve just started a Samsung tablet downloading a 770MB update, the description says:

  • Overall stability of your device has been improved
  • The security of your device has been improved

Technically I have no doubt that both those claims are true and accurate. But according to common understanding of the English language I think they are both misleading.

By “stability improved” they mean “fixed some bugs that made it unstable” and no technical person would imagine that after a certain number of such updates the number of bugs will ever reach zero and the tablet will be perfectly reliable. In fact if you should consider yourself lucky if they fix more bugs than they add. It’s not THAT uncommon for phones and tablets to be bricked (rendered unusable by software) by an update. In the past I got a Huawei Mate9 as a warranty replacement for a Nexus 6P because an update caused so many Nexus 6P phones to fail that they couldn’t be replaced with an identical phone [1].

By “security improved” they usually mean “fixed some security flaws that were recently discovered to make it almost as secure as it was designed to be”. Note that I deliberately say “almost as secure” because it’s sometimes impossible to fix a security flaw without making significant changes to interfaces which requires more work than desired for an old product and also gives a higher probability of things going wrong. So it’s sometimes better to aim for almost as secure or alternatively just as secure but with some features disabled.

Device manufacturers (and most companies in the Android space make the same claims while having the exact same bugs to deal with, Samsung is no different from the others in this regards) are not making devices more secure or more reliable than when they were initially released. They are aiming to make them almost as secure and reliable as when they were released. They don’t have much incentive to try too hard in this regard, Samsung won’t suffer if I decide my old tablet isn’t reliable enough and buy a new one, which will almost certainly be from Samsung because they make nice tablets.

As a thought experiment, consider if car repairers did the same thing. “Getting us to service your car will improve fuel efficiency”, great how much more efficient will it be than when I purchased it?

As another thought experiment, consider if car companies stopped providing parts for car repair a few years after releasing a new model. This is effectively what phone and tablet manufacturers have been doing all along, software updates for “stability and security” are to devices what changing oil etc is for cars.

30 November, 2021 09:41AM by etbe

November 29, 2021

hackergotchi for Evgeni Golov

Evgeni Golov

Getting access to somebody else's Ansible Galaxy namespace

TL;DR: adding features after the fact is hard, normalizing names is hard, it's patched, carry on.

I promise, the longer version is more interesting and fun to read!

Recently, I was poking around Ansible Galaxy and almost accidentally got access to someone else's namespace. I was actually looking for something completely different, but accidental finds are the best ones!

If you're asking yourself: "what the heck is he talking about?!", let's slow down for a moment:

  • Ansible is a great automation engine built around the concept of modules that do things (mostly written in Python) and playbooks (mostly written in YAML) that tell which things to do
  • Ansible Galaxy is a place where people can share their playbooks and modules for others to reuse
  • Galaxy Namespaces are a way to allow users to distinguish who published what and reduce name clashes to a minimum

That means that if I ever want to share how to automate installing vim, I can publish evgeni.vim on Galaxy and other people can download that and use it. And if my evil twin wants their vim recipe published, it will end up being called evilme.vim. Thus while both recipes are called vim they can coexist, can be downloaded to the same machine, and used independently.

How do you get a namespace? It's automatically created for you when you login for the first time. After that you can manage it, you can upload content, allow others to upload content and other things. You can also request additional namespaces, this is useful if you want one for an Organization or similar entities, which don't have a login for Galaxy.

Apropos login, Galaxy uses GitHub for authentication, so you don't have to store yet another password, just smash that octocat!

Did anyone actually click on those links above? If you did (you didn't, right?), you might have noticed another section in that document: Namespace Limitations. That says:

Namespace names in Galaxy are limited to lowercase word characters (i.e., a-z, 0-9) and ‘_’, must have a minimum length of 2 characters, and cannot start with an ‘_’. No other characters are allowed, including ‘.’, ‘-‘, and space. The first time you log into Galaxy, the server will create a Namespace for you, if one does not already exist, by converting your username to lowercase, and replacing any ‘-‘ characters with ‘_’.

For my login evgeni this is pretty boring, as the generated namespace is also evgeni. But for the GitHub user Evil-Pwnwil-666 it will become evil_pwnwil_666. This can be a bit confusing.

Another confusing thing is that Galaxy supports two types of content: roles and collections, but namespaces are only for collections! So it is Evil-Pwnwil-666.vim if it's a role, but evil_pwnwil_666.vim if it's a collection.

I think part of this split is because collections were added much later and have a much more well thought design of both the artifact itself and its delivery mechanisms.

This is by the way very important for us! Due to the fact that collections (and namespaces!) were added later, there must be code that ensures that users who were created before also get a namespace.

Galaxy does this (and I would have done it the same way) by hooking into the login process, and after the user is logged in it checks if a Namespace exists and if not it creates one and sets proper permissions.

And this is also exactly where the issue was!

The old code looked like this:

    # Create lowercase namespace if case insensitive search does not find match
    qs = models.Namespace.objects.filter(
        name__iexact=sanitized_username).order_by('name')
    if qs.exists():
        namespace = qs[0]
    else:
        namespace = models.Namespace.objects.create(**ns_defaults)

    namespace.owners.add(user)

See how namespace.owners.add is always called? Even if the namespace already existed? Yepp!

But how can we exploit that? Any user either already has a namespace (and owns it) or doesn't have one that could be owned. And given users are tied to GitHub accounts, there is no way to confuse Galaxy here. Now, remember how I said one could request additional namespaces, for organizations and stuff? Those will have owners, but the namespace name might not correspond to an existing user!

So all we need is to find an existing Galaxy namespace that is not a "default" namespace (aka a specially requested one) and get a GitHub account that (after the funny name conversion) matches the namespace name.

Thankfully Galaxy has an API, so I could dump all existing namespaces and their owners. Next I filtered that list to have only namespaces where the owner list doesn't contain a username that would (after conversion) match the namespace name. I found a few. And for one of them (let's call it the_target), the corresponding GitHub username (the-target) was available! Jackpot!

I've registered a new GitHub account with that name, logged in to Galaxy and had access to the previously found namespace.

This felt like sufficient proof that my attack worked and I mailed my findings to the Ansible Security team. The issue was fixed in d4f84d3400f887a26a9032687a06dd263029bde3 by moving the namespace.owners.add call to the "new namespace" branch.

And this concludes the story of how I accidentally got access to someone else's Galaxy namespace (which was revoked after the report, no worries).

29 November, 2021 08:00AM by evgeni

Dima Kogan

GL_image_display

I just spent an unspeakable number of days typing to produce something that sounds very un-impressive: an FLTK widget that can display an image. The docs and code live here. The big difference from the usual image-drawing widget is that this one uses OpenGL internally, so after the initial image load, the common operations (drawing, redrawing, panning and zooming) are very fast. I have high-resolution images in my projects, and this will make my tools much nicer.

Three separate interfaces are available:

  • C: core library and GLUT application
  • C++: FLTK widget
  • Python: FLTK widget for pyfltk

The FLTK widgets have built-in interactive panning/zooming, and the library can draw line overlays. So nice applications can be built quickly.

I already added some early disabled-by-default support into the mrcal-stereo tool to visualize the rectification and report sensitivities:

widget.png

Nice!

29 November, 2021 07:53AM by Dima Kogan

Russ Allbery

Fall haul

It's been a while since I've posted one of these, and I also may have had a few moments of deciding to support authors by buying their books even if I'm not going to get a chance to read them soon. There's also a bit of work reading in here.

Ryka Aoki — Light from Uncommon Stars (sff)
Frederick R. Chromey — To Measure the Sky (non-fiction)
Neil Gaiman, et al. — Sandman: Overture (graphic novel)
Alix E. Harrow — A Spindle Splintered (sff)
Jordan Ifueko — Raybearer (sff)
Jordan Ifueko — Redemptor (sff)
T. Kingfisher — Paladin's Hope (sff)
TJ Klune — Under the Whispering Door (sff)
Kiese Laymon — How to Slowly Kill Yourself and Others in America (non-fiction)
Yuna Lee — Fox You (romance)
Tim Mak — Misfire (non-fiction)
Naomi Novik — The Last Graduate (sff)
Shelley Parker-Chan — She Who Became the Sun (sff)
Gareth L. Powell — Embers of War (sff)
Justin Richer & Antonio Sanso — OAuth 2 in Action (non-fiction)
Dean Spade — Mutual Aid (non-fiction)
Lana Swartz — New Money (non-fiction)
Adam Tooze — Shutdown (non-fiction)
Bill Watterson — The Essential Calvin and Hobbes (strip collection)
Bill Willingham, et al. — Fables: Storybook Love (graphic novel)
David Wong — Real-World Cryptography (non-fiction)
Neon Yang — The Black Tides of Heaven (sff)
Neon Yang — The Red Threads of Fortune (sff)
Neon Yang — The Descent of Monsters (sff)
Neon Yang — The Ascent to Godhood (sff)
Xiran Jay Zhao — Iron Widow (sff)

29 November, 2021 03:45AM

November 28, 2021

hackergotchi for Wouter Verhelst

Wouter Verhelst

GR procedures and timelines

A vote has been proposed in Debian to change the formal procedure in Debian by which General Resolutions (our name for "votes") are proposed. The original proposal is based on a text by Russ Allberry, which changes a number of rules to be less ambiguous and, frankly, less weird.

One thing Russ' proposal does, however, which I am absolutely not in agreement with, is to add a absolutly hard time limit after three weeks. That is, in the proposed procedure, the discussion time will be two weeks initially (unless the Debian Project Leader chooses to reduce it, which they can do by up to one week), and it will be extended if more options are added to the ballot; but after three weeks, no matter where the discussion stands, the discussion period ends and Russ' proposed procedure forces us to go to a vote, unless all proposers of ballot options agree to withdraw their option.

I believe this is a big mistake. I think any procedure we come up with should allow for the possibility that we may end up with a situation where everyone agrees that extending the discussion time a short time is a good idea, without necessarily resetting the whole discussion time to another two weeks (modulo a decision by the DPL).

At the same time, any procedure we come up with should try to avoid the possibility of process abuse by people who would rather delay a vote ad infinitum than to see it voted upon. A hard time limit certainly does that; but I believe it causes more problems than it solves.

I think insted that it is necessary for any procedure to allow for the discussion time to be extended as long as a strong enough consensus exists that this would be beneficial.

As such, I have proposed an amendment to Russ' proposal (a full version of my proposed constitution can be seen on salsa) that hopefully solves these issues in a novel way: it allows anyone to request an extension to the discussion time, which then needs to be sponsored according to the same rules as a new ballot option. If the time extension is successfully created, those who supported the extension can then also no longer propose any new ones. Additionally, after 4 weeks, the proposed procedure allows anyone to object, so that 4 weeks is probably the practical limit -- although the possibility exists if enough support exists to extend the discussion time (or not enough to end it). The full rules involve slightly more than that (I don't like to put too much formal language in a blog post), but they're not too complicated, I think.

That proposal has received a number of seconds, but after a week it hasn't yet reached the constitutional requirement for the option to be on the ballot.

So, I guess this is a public request for more support to my proposal. If you're a Debian Developer and you agree with me that my proposed procedure is better than the alternative, please step forward and let yourself be heard.

Thanks!

28 November, 2021 07:04PM

hackergotchi for Joachim Breitner

Joachim Breitner

Zero-downtime upgrades of Internet Computer canisters

TL;DR: Zero-downtime upgrades are possible if you stick to the basic actor model.

Background

DFINITY’s Internet Computer provides a kind of serverless compute platform, where the services are WebAssemmbly programs called “canisters”. These services run without stopping (or at least that’s what it feels like from the service’s perspective; this is called “orthogonal persistence”), and process one message after another. Messages not only come from the outside (“ingress” calls), but are also exchanged between canisters.

On top of these uni-directional messages, the system provides the concept of “inter-canister calls”, which associates a respondse message with the outgoing message, and guarantees that a response will come. This RPC-like interface allows canister developers to program in the popular async/await model, where these inter-canister calls look almost like normal function calls, and the subsequent code is suspended until the response comes back.

The problem

This is all very well, until you try to upgrade your canister, i.e. install new code to fix a bug or add a feature. Because if you used the await pattern, there may still be suspended computations waiting for the response. If you swap out the program now, the code of that suspended computation will no longer be present, and the response cannot be handled! Worse, because of an infelicity with the current system’s API, when the response comes back, it may actually corrupt your service’s state.

That is why upgrading a canister requires stopping it first, which means waiting for all outstanding calls to come back. During this time, your canister is not available for new calls (so there is downtime), and worse, the length of the downtime is at the whims of the canisters you called – they could withhold the response ad infinitum, rendering your canister unupgradeable.

Clearly, this is not acceptable for any serious application. In this post, I’ll explore some of the ways to mitigate this problem, and how to create canisters that are safely instantanously (no downtime) upgradeable.

It’s a spectrum

Some canisters are trivially upgradeable, for others all hope is lost; it depends on what the canister does and how. As an overview, here is the spectrum:

  1. A canister that never performs inter-canister calls can always be upgraded without stopping.
  2. A canister that only does one-way calls, and does them in a particular way (see below), can always be upgraded without stopping.
  3. A canister that performs calls, and where it is acceptable to simply drop outstanding repsonses, can always be upgraded without stopping, once the System API has been improved and your Canister Development Kit (CDK; Motoko or Rust) has adapted.
  4. A canister that performs calls, but uses explicit continuations to handle, responses instead of the await-convenience, based on an eventually fixed System API, can be upgradeded without stopping, and will even handle responses afterwards.
  5. A canister that uses await to do inter-canister call cannot be upgraded without stopping.

In this post I will explain 2, which is possible now, in more detail. Variant 3 and 4 only become reality if and when the System API has improved.

One-way calls

A one-way call is a call where you don’t care about the response; neither the replied data, nor possible failure conditions.

Since you don’t care about the response, you can pass an invalid continuation to the system (technical detail: a Wasm table index of -1). Because it is invalid for any (realistic) Wasm module, it will stay invalid even after an upgrade, and the problem of silent corruption mentioned above is avoided. And otherwise it’s fine for this to be invalid: it means the canister “traps” once the response comes back, which is harmeless (and possibly even cheaper than a do-nothing computation).

This requires your CDK to support this kind of call. Mostly incidential, Motoko (and Candid) actually have the concept of one-way call in their type system, namely shared functions with return type () instead of async ... (Motoko is actually older than the system, and not every prediction about what the system will provide has proven successful). So, pending this PR to be released, Motoko will implement one-way calls in this way. On Rust, you have to use the System API directly or wait for cdk-rs to provide this ability (patches welcome, happy to advise).

You might wonder: How are calls useful if I don’t get to look at the response? Of course, this is a set-back – calls with responses are useful, and await is convenient. And if you have to integrate with an existing service that only provides normal calls, you are out of luck.

But if you get to design the canister and all called canisters together, it may be possible to use only one-way messages. You’d be programming in the plain actor model now, with all its advantages (simple concurrency, easy to upgrade, general robustness).

Consider for example a token ledger canister, not unlike the ICP ledger canister. For the most part, it doesn’t have to do any outgoing calls (and thus be trivially upgradeble). But say we need to add notify functionality, where the ledger canister tells other canisters about a transaction. This is a good example for a one-way call: Maybe the ledger canister doesn’t care if that notification was received? The ICP leder does care (once it comes back successful, this particular notification cannot be sent again), but maybe your ledger can do it differently: let the other canister confirm the receip via another one-way call, instead of via the reply; or simply charge for each notification and do not worry about repeated notifications.

Maybe you want to add archiving functionality, where the ledger canister streams its data to an archive canister. There, again, instead of using successful responses to confirm receipt, the archive canister can ping the ledger canister with the latest received index directly.

Yes, it changes the programming model a bit, and all involved parties have to play together, but the gain (zero-downtime upgrades) is quite valuable, and removes a fair number of other sources of issues.

And in the future?

The above is possible with today’s Internet Computer. If the System API gets improves the way I hope it will be, you have a possible middle ground: You still don’t get to use await and instead have to write your response handler as separate functions, but this way you can call any canister again, and you get the system’s assistance in mapping responses to calls. With this in place, any canister can be rewritten to a form that supports zero-downtime upgrades, without affecting its interface or what the canister can do.

28 November, 2021 05:11PM by Joachim Breitner (mail@joachim-breitner.de)

Russ Allbery

Review: Soul Music

Review: Soul Music, by Terry Pratchett

Series: Discworld #16
Publisher: Harper
Copyright: January 1995
Printing: November 2013
ISBN: 0-06-223741-1
Format: Mass market
Pages: 420

Soul Music is the sixteenth Discworld novel and something of a plot sequel to Reaper Man (although more of a sequel to the earlier Mort). I would not start reading the Discworld books here.

Susan is a student in the Quirm College for Young Ladies with an uncanny habit of turning invisible. Well, not invisible exactly; rather, people tend to forget that she's there, even when they're in the middle of talking to her. It's disconcerting for the teachers, but convenient when one is uninterested in Literature and would rather read a book.

She listened with half an ear to what the rest of the class was doing.

It was a poem about daffodils.

Apparently the poet had liked them very much.

Susan was quite stoic about this. It was a free country. People could like daffodils if they wanted to. They just should not, in Susan's very definite opinion, be allowed to take up more than a page to say so.

She got on with her education. In her opinion, school kept on trying to interfere with it.

Around her, the poet's vision was being taken apart with inexpert tools.

Susan's determinedly practical education is interrupted by the Death of Rats, with the help of a talking raven and Binky the horse, and without a lot of help from Susan, who is decidedly uninterested in being the sort of girl who goes on adventures. Adventures have a different opinion, since Susan's grandfather is Death. And Death has wandered off again.

Meanwhile, the bard Imp y Celyn, after an enormous row with his father, has gone to Ankh-Morpork. This is not going well; among other things, the Guild of Musicians and their monopoly and membership dues came as a surprise. But he does meet a dwarf and a troll in the waiting room of the Guild, and then buys an unusual music instrument in the sort of mysterious shop that everyone knows has been in that location forever, but which no one has seen before.

I'm not sure there is such a thing as a bad Discworld novel, but there is such a thing as an average Discworld novel. At least for me, Soul Music is one of those. There are some humorous bits, a few good jokes, one great character, and some nice bits of philosophy, but I found the plot forgettable and occasionally annoying. Susan is great. Imp is... not, which is made worse by the fact the reader is eventually expected to believe Susan cares enough about Imp to drive the plot.

Discworld has always been a mix of parody and Pratchett's own original creation, and I have always liked the original creation substantially more than the parody. Soul Music is a parody of rock music, complete with Cut-Me-Own-Throat Dibbler as an unethical music promoter. The troll Imp meets makes music by beating rocks together, so they decide to call their genre "music with rocks in it." The magical instrument Imp buys has twelve strings and a solid body. Imp y Celyn means "bud of the holly." You know, like Buddy Holly. Get it?

Pratchett's reference density is often on the edge of overwhelming the book, but for some reason the parody references in this one felt unusually forced and obvious to me. I did laugh occasionally, but by the end of the story the rock music plot had worn out its welcome. This is not helped by the ending being a mostly incoherent muddle of another parody (admittedly featuring an excellent motorcycle scene). Unlike Moving Pictures, which is a similar parody of Hollywood, Pratchett didn't seem to have much insightful to say about music. Maybe this will be more your thing if you like constant Blues Brothers references.

Susan, on the other hand, is wonderful, and for me is the reason to read this book. She is a delightfully atypical protagonist, and her interactions with the teachers and other students at the girl's school are thoroughly enjoyable. I would have happily read a whole book about her, and more broadly about Death and his family and new-found curiosity about the world. The Death of Rats was also fun, although more so in combination with the raven to translate. I wish this part of her story had a more coherent ending, but I'm looking forward to seeing her in future books.

Despite my complaints, the parody part of this book wasn't bad. It just wasn't as good as the rest of the book. I wanted a better platform for Susan's introduction than a lot of music and band references. If you really like Pratchett's parodies, your mileage may vary. For me, this book was fun but forgettable.

Followed, in publication order, by Interesting Times. The next Death book is Hogfather.

Rating: 7 out of 10

28 November, 2021 05:35AM

November 27, 2021

Review: A Psalm for the Wild-Built

Review: A Psalm for the Wild-Built, by Becky Chambers

Series: Monk & Robot #1
Publisher: Tordotcom
Copyright: July 2021
ISBN: 1-250-23622-3
Format: Kindle
Pages: 160

At the start of the story, Sibling Dex is a monk in a monastery in Panga's only City. They have spent their entire life there, love the buildings, know the hidden corners of the parks, and find the architecture beautiful. They're also heartily sick of it and desperate for the sound of crickets.

Sometimes, a person reaches a point in their life when it becomes absolutely essential to get the fuck out of the city.

Sibling Dex therefore decides to upend their life and travel the outlying villages doing tea service. And they do. They commission an ox-bike wagon, throw themselves into learning cultivation and herbs, experiment with different teas, and practice. It's a lot to learn, and they don't get it right from the start, but Sibling Dex is the sort of person who puts in the work to do something well. Before long, they have a new life as a traveling tea monk.

It's better than living in the City. But it still isn't enough.

We don't find out much about the moon of Panga in this story. Humans live there and it has a human-friendly biosphere with recognizable species, but it is clearly not Earth. The story does not reveal how humans came to live there. Dex's civilization is quite advanced and appears to be at least partly post-scarcity: people work and have professions, but money is rarely mentioned, poverty doesn't appear to be a problem, and Dex, despite being a monk with no obvious source of income, is able to commission the construction of a wagon home without any difficulty. They follow a religion that has no obvious Earth analogue.

The most fascinating thing about Panga is an event in its history. It previously had an economy based on robot factories, but the robots became sentient. Since this is a Becky Chambers story, the humans reaction was to ask the robots what they wanted to do and respect their decision. The robots, not very happy about having their whole existence limited to human design, decided to leave, walking off into the wild. Humans respected their agreement, rebuilt their infrastructure without using robots or artificial intelligence, and left the robots alone. Nothing has been heard from them in centuries.

As you might expect, Sibling Dex meets a robot. Its name is Mosscap, and it was selected to check in with humans. Their attempts to understand each other is much of the story. The rest is Dex's attempt to find what still seems to be missing from life, starting with an attempt to reach a ruined monastery out in the wild.

As with Chambers's other books, A Psalm for the Wild-Built contains a lot of earnest and well-meaning people having thoughtful conversations. Unlike her other books, there is almost no plot apart from those conversations of self-discovery and a profile of Sibling Dex as a character. That plus the earnestness of two naturally introspective characters who want to put their thoughts into words gave this story an oddly didactic tone for me. There are moments that felt like the moral of a Saturday morning cartoon show (I am probably dating myself), although the morals are more sophisticated and conditional. Saying I disliked the tone would be going too far, but it didn't flow as well for me as Chambers's other novels.

I liked the handling of religion, and I loved Sibling Dex's efforts to describe or act on an almost impossible to describe sense that their life isn't quite what they want. There are some lovely bits of description, including the abandoned monastery. The role of a tea monk in this imagined society is a neat, if small, bit of world-building: a bit like a counselor and a bit like a priest, but not truly like either because of the different focus on acceptance, listening, and a hot cup of tea. And Dex's interaction with Mosscap over offering and accepting food is a beautiful bit of characterization.

That said, the story as a whole didn't entirely gel for me, partly because of the didactic tone and partly because I didn't find Mosscap or the described culture of the robots as interesting as I was hoping that I would. But I'm still invested enough that I would read the sequel.

A Psalm for the Wild-Built feels like a prelude or character introduction more than a complete story. When we leave the characters, they're just getting started. You know more about the robots (and Sibling Dex) at the end than you did at the beginning, but don't expect much in the way of resolution.

Followed by A Prayer for the Crown-Shy, scheduled for 2022.

Rating: 7 out of 10

27 November, 2021 05:27AM

November 26, 2021

Reproducible Builds (diffoscope)

diffoscope 194 released

The diffoscope maintainers are pleased to announce the release of diffoscope version 194. This version includes the following changes:

[ Chris Lamb ]
* Don't traceback when comparing nested directories with non-directories.
  (Closes: reproducible-builds/diffoscope#288)

You find out more by visiting the project homepage.

26 November, 2021 12:00AM

November 25, 2021

hackergotchi for Mike Gabriel

Mike Gabriel

Touching Firefox on Linux

More as a reminder to myself, but possibly also helpful to other people who want to use Firefox on a tablet running Debian...

Without the below adjustment, finger gestures in Firefox running on a tablet result in image moving, text highlighting, etc. (operations related to copy+paste). Not the intuitively expected behaviour...

If you use e.g. GNOME on Wayland for your tablet and want to enable touch functionalities in Firefox, then switch the whole browser to native Wayland rendering. This line in ~/.profile seems to help:

export MOZ_ENABLE_WAYLAND=1

If you use a desktop environment running on top of X.Org, then make sure you have added the following line to ~/.profile:

export MOZ_USE_XINPUT2=1

Logout/login again and Firefox should be scrollable with 2-finger movements up and down, zooming in and out also works then.

light+love
Mike (aka sunweaver at debian.org)

25 November, 2021 10:01AM by sunweaver

November 24, 2021

hackergotchi for Dirk Eddelbuettel

Dirk Eddelbuettel

nanotime 0.3.4 on CRAN: Maintenance Update

Another (minor) nanotime release, now at version 0.3.4, arrived at CRAN overnight. It exports some nanoperiod functionality via a C++ header, and Leonardo and I will use this in an upcoming package that we hope to talk about a little more in a few days. It also adds a few as.character.*() methods that had not been included before.

nanotime relies on the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic. Initially implemented using the S3 system, it has benefitted greatly from a rigorous refactoring by Leonardo who not only rejigged nanotime internals in S4 but also added new S4 types for periods, intervals and durations.

The NEWS snippet adds more details.

Changes in version 0.3.4 (2021-11-24)

  • Added a few more as.character conversion function (Dirk)

  • Expose nanoperiod functionality via header file for use by other packages (Leonardo in #95 fixing #94).

Thanks to CRANberries there is also a diff to the previous version. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.

If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

24 November, 2021 10:00PM

November 23, 2021

Enrico Zini

Really lossy compression of JPEG

Suppose you have a tool that archives images, or scientific data, and it has a test suite. It would be good to collect sample files for the test suite, but they are often so big one can't really bloat the repository with them.

But does the test suite need everything that is in those files? Not necesarily. For example, if one's testing code that reads EXIF metadata, one doesn't care about what is in the image.

That technique works extemely well. I can take GRIB files that are several megabytes in size, zero out their data payload, and get nice 1Kb samples for the test suite.

I've started to collect and organise the little hacks I use for this into a tool I called mktestsample:

$ mktestsample -v samples1/*
2021-11-23 20:16:32 INFO common samples1/cosmo_2d+0.grib: size went from 335168b to 120b
2021-11-23 20:16:32 INFO common samples1/grib2_ifs.arkimet: size went from 4993448b to 39393b
2021-11-23 20:16:32 INFO common samples1/polenta.jpg: size went from 3191475b to 94517b
2021-11-23 20:16:32 INFO common samples1/test-ifs.grib: size went from 1986469b to 4860b

Those are massive savings, but I'm not satisfied about those almost 94Kb of JPEG:

$ ls -la samples1/polenta.jpg
-rw-r--r-- 1 enrico enrico 94517 Nov 23 20:16 samples1/polenta.jpg
$ gzip samples1/polenta.jpg
$ ls -la samples1/polenta.jpg.gz
-rw-r--r-- 1 enrico enrico 745 Nov 23 20:16 samples1/polenta.jpg.gz

I believe I did all I could: completely blank out image data, set quality to zero, maximize subsampling, and tweak quantization to throw everything away.

Still, the result is a 94Kb file that can be gzipped down to 745 bytes. Is there something I'm missing?

I suppose JPEG is better at storing an image than at storing the lack of an image. I cannot really complain :)

I can still commit compressed samples of large images to a git repository, taking very little data indeed. That's really nice!

23 November, 2021 06:58PM

November 22, 2021

Jonathan Wiltshire

Mischief managed

I’m finally paying up a certain amount of household technical debt, including investigating some exciting mystery cabling and insulating the space it inhabits. This has meant pulling down large chunks of ceiling (eventually, most or all of it for the insulation) on a cable hunt.

Turns out the best tool for this part of the job is a decent length of 4 by 2, some borrowed muscle, and a certain amount of bravery. Once a couple of holes have been cut the old-fashioned way to be sure there’s nothing crucial above the ceiling (like the other side of the felt roof), the 4×2 really comes into its own:

To use the 4×2, aim for a gap between two joists and imagine you’re holding a caber. Launch it. The ceiling will come off far worse than the lump of wood you’re holding.

We found the mystery cable, but didn’t really solve the mystery it creates and in the process uncovered another bizarre installation. The local lighting circuit is mostly a spur system in that white junction box by the RSJ. The overhead supplies dive under the RSJ through the junction box to the light switch including a full 3-core feed, not the usual loop-in system used in the rest of the house (I am not sure how prevalent loop-in systems are in other countries, they’re sometimes called three-plate systems – but they’re very common in the UK).

It does at least explain why I could never reverse-engineer the setup from the ceiling roses alone, which had only half the cores in the fitting than expected throughout the room (it wasn’t even that the first fitting was looped in and being a supply for the others).

On the other hand, normalising everything to a loop-in system and removing that awful rats nest of TPE should be straightforward. Neutral isn’t required in that switch so that’s one less problem.

I couldn’t resist labelling the switch in its relocated position:

Unfortunately, as valuable as that exercise was, I still have to get to the bottom of the original mystery cable which is at varying points 6mm2, 2.5mm2 and 1.5mm2 with apparently no current limiter or switch separation. Time for a bit more 4 by 2…

22 November, 2021 10:50PM by Jonathan

hackergotchi for Ricardo Mones

Ricardo Mones

Claws Mail 4 in experimental

A full month has passed since Claws Mail 4.0.0 was uploaded to Debian experimental, and, somewhat surprisingly, I've received no bug report about it.

This of course can be either because nobody has been brave enough to install it or because well, it works really nice.

For those who don't know what I'm talking about, just note that this version is the first Debian upload for the GTK+3 version of Claws Mail. There was an initial upstream release, namely 3.99, but it was less polished and also I was very busy, so I decided not to upload it. Since then I've been using git's 'gtk3' branch daily without problems, so, for me, it's as stable as its GTK+2 counterpart. There's still some rough edges, of course.

Note also that, if everything goes well, Claws Mail 4.x will be the version to be shipped with Debian 12 (bookworm).

22 November, 2021 09:49AM by mones

hackergotchi for Paul Tagliamonte

Paul Tagliamonte

Be careful when using vxlan!

I’ve spent a bit of time playing with vxlan - which is very neat, but also incredibly insecure by default.

When using vxlan, be very careful to understand how the host is connected to the internet. The kernel will listen on all interfaces for packets, which means hosts accessable to VMs it’s hosting (e.g., by bridged interface or a private LAN will accept packets from VMs and inject them into arbitrary VLANs, even ones it’s not on.

I reported this to the kernel mailing list to no reply with more technical details.

The tl;dr is:

  $ ip link add vevx0a type veth peer name vevx0z
  $ ip addr add 169.254.0.2/31 dev vevx0a
  $ ip addr add 169.254.0.3/31 dev vevx0z
  $ ip link add vxlan0 type vxlan id 42 \
    local 169.254.0.2 dev vevx0a dstport 4789
  $ # Note the above 'dev' and 'local' ip are set here
  $ ip addr add 10.10.10.1/24 dev vxlan0

results in vxlan0 listening on all interfaces, not just vevx0z or vevx0a. To prove it to myself, I spun up a docker container (using a completely different network bridge – with no connection to any of the interfaces above), and ran a Go program to send VXLAN UDP packets to my bridge host:

$ docker run -it --rm -v $(pwd):/mnt debian:unstable /mnt/spam 172.17.0.1:4789
$

which results in packets getting injected into my vxlan interface

$ sudo tcpdump -e -i vxlan0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vxlan0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:30:15.746754 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746773 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746787 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746801 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746815 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746827 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746870 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746885 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746899 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
21:30:15.746913 de:ad:be:ef:00:01 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 64: truncated-ip - 27706 bytes missing! 33.0.0.0 > localhost: ip-proto-114
10 packets captured
10 packets received by filter
0 packets dropped by kernel

(the program in question is the following:)

  package main

  import (
      "net"
      "os"
      "github.com/mdlayher/ethernet"
      "github.com/mdlayher/vxlan"
  )
  func main() {
      conn, err := net.Dial("udp", os.Args[1])
      if err != nil { panic(err) }
      for i := 0; i < 10; i++ {
          vxf := &vxlan.Frame{
              VNI: vxlan.VNI(42),
              Ethernet: &ethernet.Frame{
                  Source:      net.HardwareAddr{0xDE, 0xAD, 0xBE,
0xEF, 0x00, 0x01},
                  Destination: net.HardwareAddr{0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF},
                  EtherType:   ethernet.EtherTypeIPv4,
                  Payload:     []byte("Hello, World!"),
              },
          }
          frb, err := vxf.MarshalBinary()
          if err != nil { panic(err) }
          _, err = conn.Write(frb)
          if err != nil { panic(err) }
      }
  }

When using vxlan, be absolutely sure all hosts that can address any interface on the host are authorized to send arbitrary packets into any VLAN that box can send to, or there’s very careful and specific controls and firewalling. Note this includes public interfaces (e.g., dual-homed private network / internet boxes), or any type of dual-homing (VPNs, etc).

22 November, 2021 02:39AM

November 21, 2021

Julian Andres Klode

APT Z3 Solver Basics

Z3 is a theorem prover developed at Microsoft research and available as a dynamically linked C++ library in Debian-based distributions. While the library is a whopping 16 MB, and the solver is a tad slow, it’s permissive licensing, and number of tactics offered give it a huge potential for use in solving dependencies in a wide variety of applications.

Z3 does not need normalized formulas, but offers higher level abstractions like atmost and atleast and implies, that we will make use of together with boolean variables to translate the dependency problem to a form Z3 understands.

In this post, we’ll see how we can apply Z3 to the dependency resolution in APT. We’ll only discuss the basics here, a future post will explore optimization criteria and recommends.

Translating the universe

APT’s package universe consists of 3 relevant things: packages (the tuple of name and architecture), versions (basically a .deb), and dependencies between versions.

While we could translate our entire universe to Z3 problems, we instead will construct a root set from packages that were manually installed and versions marked for installation, and then build the transitive root set from it by translating all versions reachable from the root set.

For each package P in the transitive root set, we create a boolean literal P. We then translate each version P1, P2, and so on. Translating a version means building a boolean literal for it, e.g. P1, and then translating the dependencies as shown below.

We now need to create two more clauses to satisfy the basic requirements for debs:

  1. If a version is installed, the package is installed; and vice versa. We can encode this requirement for P above as P == atleast({P1,P2}, 1).
  2. There can only be one version installed. We add an additional constraint of the form atmost({P1,P2}, 1).

We also encode the requirements of the operation.

  1. For each package P that is manually installed, add a constraint P.
  2. For each version V that is marked for install, add a constraint V.
  3. For each package P that is marked for removal, add a constraint !P.

Dependencies

Packages in APT have dependencies of two basic forms: Depends and Conflicts, as well as variations like Breaks (identical to Conflicts in solving terms), and Recommends (soft Depends) - we’ll ignore those for now. We’ll discuss Conflicts in the next section.

Let’s take a basic dependency list: A Depends: X|Y, Z. To represent that dependency, we expand each name to a list of versions that can satisfy the dependency, for example X1|X2|Y1, Z1.

Translating this dependency list to our Z3 solver, we create boolean variables X1,X2,Y1,Z1 and define two rules:

  1. A implies atleast({X1,X2,Y1}, 1)
  2. A implies atleast({Z1}, 1)

If there actually was nothing that satisfied the Z requirement, we’d have added a rule not A. It would be possible to simply not tell Z3 about the version at all as an optimization, but that adds more complexity, and the not A constraint should not cause too many problems.

Conflicts

Conflicts cannot have or in them. A dependency B Conflicts: X, Y means that only one of B, X, and Y can be installed. We can directly encode this in Z3 by using the constraint atmost({B,X,Y}, 1). This is an optimized encoding of the constraint: We could have encoded each conflict in the form !B or !X, !B or !X, and so on. Usually this leads to worse performance as it introduces additional clauses.

Complete example

Let’s assume we start with an empty install and want to install the package a below.

Package: a
Version: 1
Depends: c | b

Package: b
Version: 1

Package: b
Version: 2
Conflicts: x

Package: d
Version: 1

Package: x
Version: 1

The translation in Z3 rules looks like this:

  1. Package rules for a:
    1. a == atleast({a1}, 1) - package is installed iff one version is
    2. atmost({a1}, 1) - only one version may be installed
    3. a – a must be installed
  2. Dependency rules for a
    1. implies(a1, atleast({b2, b1}, 1)) – the translated dependency above. note that c is gone, it’s not reachable.
  3. Package rules for b:
    1. b == atleast({b1,b2}, 1) - package is installed iff one version is
    2. atmost({b1, b2}, 1) - only one version may be installed
  4. Dependencies for b (= 2):
    1. atmost({b2, x1}, 1) - the conflicts between x and b = 2 above
  5. Package rules for x:
    1. x == atleast({x1}, 1) - package is installed iff one version is
    2. atmost({x1}, 1) - only one version may be installed

The package d is not translated, as it is not reachable from the root set {a1}, the transitive root set is {a1,b1,b2,x1}.

Next iteration: Optimization

We have now constructed the basic set of rules that allows us to solve solve our dependency problems (equivalent to SAT), however it might lead to suboptimal solutions where it removes automatically installed packages, or installs more packages than necessary, to name a few examples.

In our next iteration, we have to look at introducing optimization; for example, have the minimum number of removals, the minimal number of changed packages, or satisfy as many recommends as possible. We will also look at the upgrade problem (upgrade as many packages as possible), the autoremove problem (remove as many automatically installed packages as possible).

21 November, 2021 07:49PM

Antoine Beaupré

The last syncmaildir crash

My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post but this post talks about how I recovered the mail spool so that tools like those could correctly synchronise the mail spool again.

The latest crash

On Monday, SMD just started failing with this error:

nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded.
nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir.
nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir.
nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file?
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error
nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.

What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.

In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.

Reviewing my previous mail crash documentation, it seems most solutions involve talking to an IMAP server, so I figured I would just do that. Wanting to try something new, i gave isync (AKA mbsync) a try. Oh dear, I did not expect how much trouble just talking to my IMAP server would be, which wasn't not isync's fault, for what that's worth. It was the primary tool I used to debug things, and served me well in that regard.

Mailbox corruption

The first thing I found out is that certain messages in the IMAP spool were corrupted. mbsync would stop on a FETCH command and Dovecot would give me those errors on the server side.

"wrong W value"

nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578

At least this first error was automatically healed by Dovecot (by renaming the file without the W= flag). The problem is that the FETCH command fails and mbsync exits noisily. So you need to constantly restart mbsync with a silly command like:

while ! mbsync -a; do sleep 1; done

"cached message size larger than expected"

nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=)
nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor

This second problem is much harder to fix, because dovecot does not recover automatically. This is Dovecot complaining that the cached size (the S= field, but also present in Dovecot's metadata files) doesn't match the file size.

I wonder if at least some of those messages were corrupted in the OfflineIMAP to syncmaildir migration because part of that procedure is to run the strip_header script to remove content from the emails. That could easily have broken things since the files do not also get renamed.

Workaround

So I read a lot of the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool.

And no, rebuilding the index files didn't work. Also tried doveadm force-resync -u anarcat which didn't do anything.

In the end I also had to do this, because the wrong cache values were also stored elsewhere.

service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start

This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).

Incoherence between Maildir and IMAP

Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
384836
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 
369221

As it turns out, mbsync was not at fault here either: this was yet more mail spool corruption.

It's actually 26 folders (out of 205) with inconsistent sizes, which can be found with:

for folder in * .[^.]* ; do 
  printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l );
done

The special \! -name '.*' bit is to ignore the mbsync metadata, which creates .uidvalidity and .mbsyncstate in every folder. That ignores about 200 files but since they are spread around all folders, which was making it impossible to review where the problem was.

Here is what the diff looks like:

--- Maildir-list    2021-11-17 20:42:36.504246752 -0500
+++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500
@@ -6,16 +6,15 @@
[...]
 .Archives  1
 .Archives.2010 3553
-.Archives.2011 3583
-.Archives.2012 12593
+.Archives.2011 3582
+.Archives.2012 620
 .Archives.2013 8576
 .Archives.2014 11057
-.Archives.2015 8173
+.Archives.2015 8165
 .Archives.2016 54
 .band  34
 .bitbuck   1
@@ -38,13 +37,12 @@
 .couchsurfers  2
-cur    11285
+cur    11280
 .current   130
 .cv    2
 .debbug    262
-.debian    37544
-drafts 1
-.Drafts    4
+.debian    37533
+.Drafts    2
 .drone 241
 .drupal    188
 .drupal-devel  303
[...]

Misfiled messages

It's a bit all over the place, but we can already notice some huge differences between mailboxes, for example in the Archives folders. As it turns out, at least 12,000 of those missing mails were actually misfiled: instead of being in the Maildir/.Archives.2012/cur/ folder, they were directly in Maildir/.Archives.2012/. This is something that doesn't matter for SMD (and possibly for notmuch? it does matter, notmuch suddenly found 12,000 new mails) but that definitely matters to Dovecot and therefore mbsync...

After moving those files around, we still have 4,000 message missing:

anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*' | wc -l 
381196
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*' | wc -l 
385053

The problem is that those 4,000 missing mails are harder to track. Take, for example, .Archives.2011, which has a single message missing, out of 3,582. And the files are not identical: the checksums don't match after going through the IMAP transport, so we can't use a tool like hashdeep to compare the trees and find why any single file is missing.

"register" folder

One big chunk of the 4,000, however, is a special folder called register in my spool, which I am syncing separately (see Securing registration email for details on that setup). That actually covers 3,700 of those messages, so I actually have a more modest 300 messages to figure out, after (easily!) configuring mbsync to sync that folder separately:

 @@ -30,9 +33,29 @@ Slave :anarcat-local:
  # Exclude everything under the internal [Gmail] folder, except the interesting folders
  #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
  # Or include everything
 -Patterns *
 +#Patterns *
 +Patterns * !register  !.register
  # Automatically create missing mailboxes, both locally and on the server
  #Create Both
  Create slave
  # Sync the movement of messages between folders and deletions, add after making sure the sync works
  #Expunge Both
 +
 +IMAPAccount anarcat-register
 +Host imap.anarc.at
 +User register
 +PassCmd "pass imap.anarc.at-register"
 +SSLType IMAPS
 +CertificateFile /etc/ssl/certs/ca-certificates.crt
 +
 +IMAPStore anarcat-register-remote
 +Account anarcat-register
 +
 +MaildirStore anarcat-register-local
 +SubFolders Maildir++
 +Inbox ~/Maildir-mbsync/.register/
 +
 +Channel anarcat-register
 +Master :anarcat-register-remote:
 +Slave :anarcat-register-local:
 +Create slave

"tmp" folders and empty messages

After syncing the "register" messages, I end up with the measly little 160 emails out of sync:

anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*' | wc -l 
384900
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*' | wc -l 
385059

Argh. After more digging, I have found 131 mails in the tmp/ directories of the client's mail spool. Mysterious! On the server side, it's even more files, and not the same ones. Possible that those were mails that were left there during a failed delivery of some sort, during a power failure or some sort of crash? Who knows. It could be another race condition in SMD if it runs while mail is being delivered in tmp/...

The first thing to do with those is to cleanup a bunch of empty files (21 on angela):

find .[^.]*/tmp -type f -empty -delete

As it turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing

find .[^.]*/tmp -type f | while read path; do
  msgid=$(grep -m 1  -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g');
  if notmuch count --exclude=false  "id:$msgid" | grep -q 0; then
    echo "$path <$msgid> not in notmuch" ;
  fi;
done

... which is good. Or, to put it another way, this is safe:

find .[^.]*/tmp -type f -delete

Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't pick up on those changes at all and still sees files in tmp/ directories on the client side, so we need to operate the same twisted logic there.

notmuch to the rescue again

After cleaning that on the client, we get:

anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*' | wc -l 
384928
anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*' | wc -l 
384901

Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
385034
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 
384993

That is: even more messages missing (now 37). Sigh.

Thankfully, this is something notmuch can help with: it can index all files by Message-ID (which I learned is case-insensitive, yay) and tell us which messages don't make it through.

Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.

Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:

notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids
notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids

And then we can see how many messages notmuch thinks are missing:

$ wc -l *msgids
372723 Maildir-mbsync-msgids
372752 Maildir-msgids

That's 29 messages. Oddly, it doesn't exactly match the find output:

anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 
385204
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
385241

That is 10 more messages. Ugh. But actually, I know what those are: more misfiled messages (in a .folder/draft/ directory, bizarrely, so the totals actually match.

In the notmuch output, there's a lot of stuff like this:

id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4

Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.

What remains is:

anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids  | grep '^\-[^-]' | grep -v sha1 | wc -l 
2
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids  | grep '^\+[^+]' | grep -v sha1 | wc -l 
21
anarcat@angela:~(main)$ 

ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool.

Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.

Conclusion

As we'll see in the next article, SMD has stellar performance. But that comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.

In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.

In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.

In the end, however, the last drop was just another weird bug which, ironically, SMD mysteriously recovered from on its own while I was writing this documentation and migrating away from it.

In any case, I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.

Alternatives like mbsync provide fast and reliable transport, including over SSH. See the next article for further discussion of the alternatives.

21 November, 2021 04:04PM

mbsync vs OfflineIMAP

After recovering from my latest email crash (previously, previously), I had to figure out which tool I should be using. I had many options but I figured I would start with a popular one (mbsync).

But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time.

Read on for the details.

Benchmark setup

All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye.

The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive.

The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C).

The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:

$ notmuch count --exclude=false
372758
$ du -sh --exclude xapian Maildir
13G Maildir

The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely.

Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline.

A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
 12,647,814,731 100%   37.85MB/s    0:05:18 (xfr#394981, to-chk=0/395815)    
72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k
8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps

That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/395815)    
1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k
120inputs+0outputs (3major+12709minor)pagefaults 0swaps

As an extra curiosity, here's the performance with tar, pretty similar with rsync, minus incremental which I cannot be bothered to figure out right now:

anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/ | pv -s 13G | tar xf - 
56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k
0inputs+0outputs (0major+7266minor)pagefaults 0swaps
12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%

Interesting that rsync manages to almost beat a plain tar on file transfer, I'm actually surprised by how well it performs here, considering there are many little files to transfer.

(But then again, this maybe is exactly where rsync shines: while tar needs to glue all those little files together, rsync can just directly talk to the other side and tell it to do live changes. Something to look at in another article maybe?)

Since both ends are NVMe drives, those should easily saturate a gigabit link. And in fact, a backup of the server mail spool achieves much faster transfer rate on disks:

anarcat@marcos:~$ tar fc - Maildir | pv -s 13G > Maildir.tar
15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%

That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:

anarcat@angela:~(main)$ tar fc - Maildir | pv -s 17G > Maildir.tar
16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%

So those disks should be able to saturate a gigabit link, and they are not the bottleneck on fast links. Which begs the question of what is blocking performance of a similar transfer over the gigabit link, but that's another question altogether, because no sync program ever reaches the above performance anyways.

Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since then (May 2018), but this could be due to my larger mail spool as well.

mbsync

The isync (AKA mbsync) project is written in C and supports syncing Maildir and IMAP folders, with possibly multiple replicas. I haven't tested this but I suspect it might be possible to sync between two IMAP servers as well. It supports partial mirorrs, message flags, full folder support, and "trash" functionality.

Complex configuration file

I started with this .mbsyncrc configuration file:

SyncState *
Sync New ReNew Flags

IMAPAccount anarcat
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt

IMAPStore anarcat-remote
Account anarcat

MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir-mbsync/

Channel anarcat
# AKA Far, convert when all clients are 1.4+
Master :anarcat-remote:
# AKA Near
Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
Patterns *
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both

Long gone are the days where I would spend a long time reading a manual page to figure out the meaning of every option. If that's your thing, you might like this one. But I'm more of a "EXAMPLES section" kind of person now, and I somehow couldn't find a sample file on the website. I started from the Arch wiki one but it's actually not great because it's made for Gmail (which is not a usual Dovecot server). So a sample config file in the manpage would be a great addition. Thankfully, the Debian packages ships one in /usr/share/doc/isync/examples/mbsyncrc.sample but I only found that after I wrote my configuration. It was still useful and I recommend people take a look if they want to understand the syntax.

Also, that syntax is a little overly complicated. For example, Far needs colons, like:

Far :anarcat-remote:

Why? That seems just too complicated. I also found that sections are not clearly identified: IMAPAccount and Channel mark section beginnings, for example, which is not at all obvious until you learn about mbsync's internals. There are also weird ordering issues: the SyncState option needs to be before IMAPAccount, presumably because it's global.

Using a more standard format like .INI or TOML could improve that situation.

Stellar performance

A transfer of the entire mail spool takes 56 minutes and 6 seconds, which is impressive.

It's not quite "line rate": the resulting mail spool was 12GB (which is a problem, see below), which turns out to be about 29Mbit/s and therefore not maxing the gigabit link, and an order of magnitude slower than rsync.

The incremental runs are roughly 2 seconds, which is even more impressive, as that's actually faster than rsync:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.015       0.052       1.930       2.029       2.105       
user        0.660       0.040       0.592       0.661       0.722       
sys         0.338       0.033       0.268       0.341       0.387    

Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release originally failed because of a corrupted message that triggered bug 999804 (see below). Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m.

Once the upstream patch is applied, performance with 1.4.3 is fairly similar, considering that the new sync included the register folder with 4000 messages:

120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k
29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps

That is ~13GB in ~60 minutes, which gives us 28.3Mbps. Incrementals are also pretty similar to 1.3.x, again considering the double-connect cost:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.500       0.087       2.340       2.491       2.629       
user        0.718       0.037       0.679       0.711       0.793       
sys         0.322       0.024       0.284       0.320       0.365

Those tests were all done on a Gigabit link, but what happens on a slower link? My server uplink is slow: 25 Mbps down, 6 Mbps up. There mbsync is worse than the SMD baseline:

===> multitime results
1: mbsync -a
Mean        Std.Dev.    Min         Median      Max
real        31.531      0.724       30.764      31.271      33.100      
user        1.858       0.125       1.721       1.818       2.131       
sys         0.610       0.063       0.506       0.600       0.695       

That's 30 seconds for a sync, which is an order of magnitude slower than SMD.

Great user interface

Compared to OfflineIMAP and (ahem) SMD, the mbsync UI is kind of neat:

anarcat@angela:~(main)$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 1/2  B: 204/205  F: +0/0 *0/0 #0/0  N: +1/200 *0/0 #0/0

(Note that nice switch away from slavery-related terms too.)

The display is minimal, and yet informative. It's not obvious what does mean at first glance, but the manpage is useful at least for clarifying that:

This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).

In other words:

  • C 2/2: channels done/total (2 done out of 2)
  • B 204/205: mailboxes done/total (204 out of 205)
  • F: changes on the far side
  • N: +10/200 *0/0 #0/0: changes on the "near" side:
    • +10/200: 10 out of 200 messages downloaded
    • *0/0: no flag changed
    • #0/0: no message deleted

You get used to it, in a good way. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see a summary mail traffic in the logs.

Interoperability issue

In my notmuch setup, I have bound key S to "mark spam", which basically assigns the tag spam to the message and removes a bunch of others. Then I have a notmuch-purge script which moves that message to the spam folder, for training purposes. It basically does this:

notmuch search --output=files --format=text0 "$search_spam" \
    | xargs -r -0 mv -t "$HOME/Maildir/${PREFIX}junk/cur/"

This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:

Maildir error: duplicate UID 37578.

And indeed, there are now two messages with that UID in the mailbox:

anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*'
Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S
Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S

This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":

When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:

(setq mu4e-change-filenames-when-moving t)

So it seems I would need to fix my script. It's unclear how the paths should be renamed, which is unfortunate, because I would need to change my script to adapt to mbsync, but I can't tell how just from reading the above.

(A manual fix is actually to rename the file to remove the U= field: mbsync will generate a new one and then sync correctly.)

Fortunately, someone else already fixed that issue: afew, a notmuch tagging script (much puns, such hurt), has a move mode that can rename files correctly, specifically designed to deal with mbsync. I had already been told about afew, but it's one more reason to standardize my notmuch hooks on that project, it looks like.

Update: I have tried to use afew and found it has significant performance issues. It also has a completely different paradigm to what I am used to: it assumes all incoming mail has a new and lays its own tags on top of that (inbox, sent, etc). It can only move files from one folder at a time (see this bug) which breaks my spam training workflow. In general, I sync my tags into folders (e.g. ham, spam, sent) and message flags (e.g. inbox is F, unread is "not S", etc), and afew is not well suited for this (although there are hacks that try to fix this). I have worked hard to make my tagging scripts idempotent, and it's something afew doesn't currently have. Still, it would be better to have that code in Python than bash, so maybe I should consider my options here.

Stability issues

The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect.

The thing is: mbsync is really fast, but the downside of that is that it's written in C, and with that comes a whole set of security issues. The Debian security tracker has only three CVEs on isync, but the above issues show there could be many more.

Reading the source code certainly did not make me very comfortable with trusting it with untrusted data. I considered sandboxing it with systemd (below) but having systemd run as a --user process makes that difficult. I also considered using an apparmor profile but that is not trivial because we need to allow SSH and only some parts of it...

Thankfully, upstream has been diligent at addressing the issues I have found. They provided a patch within a few days which did fix the sync issues.

Automation with systemd

The Arch wiki has instructions on how to setup mbsync as a systemd service. It suggests using the --verbose (-V) flag which is a little intense here, as it outputs 1444 lines of messages.

I have used the following .service file:

[Unit]
Description=Mailbox synchronization service
ConditionHost=!marcos
Wants=network-online.target
After=network-online.target
Before=notmuch-new.service

[Service]
Type=oneshot
ExecStart=/usr/bin/mbsync -a
Nice=10
IOSchedulingClass=idle
NoNewPrivileges=true

[Install]
WantedBy=default.target

And the following .timer:

[Unit]
Description=Mailbox synchronization timer
ConditionHost=!marcos

[Timer]
OnBootSec=2m
OnUnitActiveSec=5m
Unit=mbsync.service

[Install]
WantedBy=timers.target

Note that we trigger notmuch through systemd, with the Before and also by adding mbsync.service to the notmuch-new.service file:

[Unit]
Description=notmuch new
After=mbsync.service

[Service]
Type=oneshot
Nice=10
ExecStart=/usr/bin/notmuch new

[Install]
WantedBy=mbsync.service

An improvement over polling repeatedly with a .timer would be to wake up only on IMAP notify, but neither imapnotify nor goimapnotify seem to be packaged in Debian. It would also not cover for the "sent folder" use case, where we need to wake up on local changes.

Password-less setup

The sample file suggests this should work:

IMAPStore remote
Tunnel "ssh -q host.remote.com /usr/sbin/imapd"

Add BatchMode, restrict to IdentitiesOnly, provide a password-less key just for this, add compression (-C), find the Dovecot imap binary, and you get this:

IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"

And it actually seems to work:

$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 0/2  B: 0/1  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied
C: 2/2  B: 205/205  F: +0/0 *0/0 #0/0  N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087

It's a bit noisy, however. dovecot/imap doesn't have a "usage" to speak of, but even the source code doesn't hint at a way to disable that Error message, so that's unfortunate. That socket is owned by root:dovecot so presumably Dovecot runs the imap process as $user:dovecot, which we can't do here. Oh well?

Interestingly, the SSH setup is not faster than IMAP.

With IMAP:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.367       0.065       2.220       2.376       2.458       
user        0.793       0.047       0.731       0.776       0.871       
sys         0.426       0.040       0.364       0.434       0.476

With SSH:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.515       0.088       2.274       2.532       2.594       
user        0.753       0.043       0.645       0.766       0.804       
sys         0.328       0.045       0.212       0.340       0.393

Basically: 200ms slower. Tolerable.

Migrating from SMD

The above was how I migrated to mbsync on my first workstation. The work on the second one was more streamlined, especially since the corruption on mailboxes was fixed:

  1. install isync, with the patch:

    dpkg -i isync_1.4.3-1.1~_amd64.deb
    
  2. copy all files over from previous workstation to avoid a full resync (optional):

    rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
    
  3. rename all files to match new hostname (optional):

    find Maildir-mbsync/ -type f -name '*.angela,*' -print0 |  rename -0 's/\.angela,/\.curie,/'
    
  4. trash the notmuch database (optional):

    rm -rf Maildir-mbsync/.notmuch/xapian/
    
  5. disable all smd and notmuch services:

    systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
    
  6. do one last sync with smd:

    smd-pull --show-tags ; smd-push --show-tags ; notmuch new ; notmuch-sync-flagged -v
    
  7. backup notmuch on the client and server:

    notmuch dump | pv > notmuch.dump
    
  8. backup the maildir on the client and server:

    cp -al Maildir Maildir-bak
    
  9. create the SSH key:

    ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync
    cat .ssh/id_ed25519_mbsync.pub
    
  10. add to .ssh/authorized_keys on the server, like this:

    command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...

  11. move old files aside, if present:

    mv Maildir Maildir-smd
    
  12. move new files in place (CRITICAL SECTION BEGINS!):

    mv Maildir-mbsync Maildir
    
  13. run a test sync, only pulling changes:

    mbsync --create-near --remove-none --expunge-none --noop anarcat-register

  14. if that works well, try with all mailboxes:

    mbsync --create-near --remove-none --expunge-none --noop -a

  15. if that works well, try again with a full sync:

    mbsync register mbsync -a

  16. reindex and restore the notmuch database, this should take ~25 minutes:

    notmuch new
    pv notmuch.dump | notmuch restore
    
  17. enable the systemd services and retire the smd-* services:

    systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer rm ~/.config/systemd/user/smd* systemctl daemon-reload

During the migration, notmuch helpfully told me the full list of those lost messages:

[...]
Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com
Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com
Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com
[...]
Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org
Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org
Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47
Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d
Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d
Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c
Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562
Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13
[...]

As detailed in the crash report, all of those were actually innocuous and could be ignored.

Also note that we completely trash the notmuch database because it's actually faster to reindex from scratch than let notmuch slowly figure out that all mails are new and all the old mails are gone. The fresh indexing took:

nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.).
nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.

While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.

Current config file

Putting it all together, I ended up with the following configuration file:

SyncState *
Sync All

# IMAP side, AKA "Far"
IMAPAccount anarcat-imap
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt

IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"

IMAPStore anarcat-remote
Account anarcat-tunnel

# Maildir side, AKA "Near"
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir/

# what binds Maildir and IMAP
Channel anarcat
Far :anarcat-remote:
Near :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
#Patterns *
Patterns * !register  !.register
# Automatically create missing mailboxes, both locally and on the server
Create Both
#Create Near
# Sync the movement of messages between folders and deletions, add after making sure the sync works
Expunge Both
# Propagate mailbox deletion
Remove both

IMAPAccount anarcat-register-imap
Host imap.anarc.at
User register
PassCmd "pass imap.anarc.at-register"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt

IMAPAccount anarcat-register-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C register@imap.anarc.at /usr/lib/dovecot/imap"

IMAPStore anarcat-register-remote
Account anarcat-register-tunnel

MaildirStore anarcat-register-local
SubFolders Maildir++
Inbox ~/Maildir/.register/

Channel anarcat-register
Far :anarcat-register-remote:
Near :anarcat-register-local:
Create Both
Expunge Both
Remove both

Note that it may be out of sync with my live (and private) configuration file, as I do not publish my "dotfiles" repository publicly for security reasons.

OfflineIMAP

I've used OfflineIMAP for a long time before switching to SMD. I don't exactly remember why or when I started using it, but I do remember it became painfully slow as I started using notmuch, and would sometimes crash mysteriously. It's been a while, so my memory is hazy on that.

It also kind of died in a fire when Python 2 stop being maintained. The main author moved on to a different project, imapfw which could serve as a framework to build IMAP clients, but never seemed to implement all of the OfflineIMAP features and certainly not configuration file compatibility. Thankfully, a new team of volunteers ported OfflineIMAP to Python 3 and we can now test that new version to see if it is an improvement over mbsync.

Crash on full sync

The first thing that happened on a full sync is this crash:

Copy message from RemoteAnarcat:junk:
 ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Thread 'Copy message from RemoteAnarcat:junk' terminated with exception:
Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode
    for c in binary.decode():
AttributeError: 'memoryview' object has no attribute 'decode'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run
    Thread.run(self)
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)
AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')


Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception:
thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat')
ERROR: Exceptions occurred during the run!
ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)

Folder junk [acc: Anarcat]:
 Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk
Command exited with non-zero status 100
5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k
96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps

That only transferred about 8GB of mail, which gives us a transfer rate of 5.3Mbit/s, more than 5 times slower than mbsync. This bug is possibly limited to the bullseye version of offlineimap3 (the lovely 0.0~git20210225.1e7ef9e+dfsg-4), while the current sid version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1) seems unaffected.

Tolerable performance

The new release still crashes, except it does so at the very end, which is an improvement, since the mails do get transferred:

 *** Finished account 'Anarcat' in 511:12
ERROR: Exceptions occurred during the run!
ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes).
 AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(

ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes).
 AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(

ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686'

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap
    raise OfflineImapError(reason, severity)

Command exited with non-zero status 1
8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k
56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps
"offlineimap  -o " took 8 hours 31 mins 15 secs

This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is nine times slower than mbsync, almost an order of magnitude!

Now that we have a full sync, we can test incremental synchronization. That is also much slower:

===> multitime results
1: sh -c "offlineimap -o || true"
            Mean        Std.Dev.    Min         Median      Max
real        24.639      0.513       23.946      24.526      25.708      
user        23.912      0.473       23.404      23.795      24.947      
sys         1.743       0.105       1.607       1.729       2.002

That is also an order of magnitude slower than mbsync, and significantly slower than what you'd expect from a sync process. ~30 seconds is long enough to make me impatient and distracted; 3 seconds, less so: I can wait and see the results almost immediately.

Integrity check

That said: this is still on a gigabit link. It's technically possible that OfflineIMAP performs better than mbsync over a slow link, but I Haven't tested that theory.

The OfflineIMAP mail spool is missing quite a few messages as well:

anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*' | wc -l 
381463
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
385247

... although that's probably all either new messages or the register folder, so OfflineIMAP might actually be in a better position there. But digging in more, it seems like the actual per-folder diff is fairly similar to mbsync: a few messages missing here and there. Considering OfflineIMAP's instability and poor performance, I have not looked any deeper in those discrepancies.

Other projects to evaluate

Those are all the options I have considered, in alphabetical order

  • doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync, written in C
  • fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
  • getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap, but requires running an IMAP server locally, Perl
  • isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), written in C, review above
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, mentions rsmtp which is a nice name for rsendmail. not evaluated because it seems awfully complex to setup, Haskell
  • nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup, Golang
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above, Python

Most projects were not evaluated due to lack of time.

Conclusion

I'm now using mbsync to sync my mail. I'm a little disappointed by the synchronisation times over the slow link, but I guess that's on par for the course if we use IMAP. We are bound by the network speed much more than with custom protocols. I'm also worried about the C implementation and the crashes I have witnessed, but I am encouraged by the fast upstream response.

Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.

21 November, 2021 04:04PM

November 20, 2021

hackergotchi for Jonathan Dowland

Jonathan Dowland

hledger footguns

I wrote in budgeting tools that I was taking a look at Plain Text Accounting and in particular, hledger. My Jury's still out on the tools, but in the time I've been looking at them I've come across a couple of foot-guns I thought it was worth writing down.

hledger's ledger format is derived from that of its predecessor ledger, and so some of the problems might be inherited.

1. significant white space delimiters

The basic syntax for a transaction looks like this

2020-03-15 client payment
    assets:checking         $ 2000
    income:consulting       $-2000

There's some significant white space delimiters in play. The most subtle is what separates the account names from the values: it is two or more spaces. A single space, and the value is treated as part of the account name. For some reason I hit this frequently with trying to encode opening balances: the account name used as the source of the initial balances is something not otherwise generally referred to again (something like equity:opening balances) and the transaction amount is inferred where possible, so I ended up with a bunch of accounts named equity:opening balances £100 and similar.

2. flexible decimal delimiter

The value of transactions can be interspersed with commas and periods to make it more readable: e.g. $2000 could be written as $2,000. Different locales have different conventions here: It seems some(/most/all?) of Europe use periods to separate out the units and a comma to delimit the fractional part, whereas the US and the UK do the opposite. There is no built-in association between the currency symbol you are using and the period/comma convention: it's quite possible to accidentally write a number which is interpreted differently to how you intended, and it doesn't matter if you are using $ or £ etc.

3. new syntax has unexpected results in old versions

Finally, my favourite. hledger has a notion of rules that can be used to match transactions when importing from CSV. The format looks like this:

if (match rule)
& (another rule)
account1 some:account:from
account2 some:account:to

By default, multiple rules in sequence like above are OR'd: any of them can match. The & prefix switches the behaviour to AND. But, & is a relatively new addition: it's not supported in 1.18.1, the version in Debian stable, which upstream released in June 2020. In prior versions the & prefix is not a syntax error, or at least, not one that's reported: it's silently ignored; meaning, the line with the & does nothing, and any of the other rules in the set will match. This is easy to miss, and means imports could be incorrectly posted.

20 November, 2021 09:03PM

November 19, 2021

Mike Hommey

Announcing git-cinnabar 0.5.8

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.5.7?

  • Updated git to 2.34.0 for the helper.
  • Python 3.5 and newer are now officially supported. Git-cinnabar will try to use the python3 program by default, but will fallback to python2.7 if that’s where the Mercurial libraries are available. It is possible to pick a specific python with the GIT_CINNABAR_PYTHON environment variable.
  • Fixed compatibility with Mercurial 5.8 and newer.
  • The prebuilt binaries are now optimized on arm64 macOS and Windows.
  • git cinnabar download now properly returns an error code when failing to extract the prebuilt binaries.
  • Pushing to a non-empty Mercurial repository without having pulled at least once from it is now prevented.
  • Replaced the nagging about fsck with a smaller check always happening after pulling.
  • Fail earlier on git fetch hg::url <sha1> (it would properly fetch the Mercurial changeset and its ancestors, but git would fail at the end because the sha1 is not a git sha1 ; use git cinnabar fetch instead)
  • Minor fixes.

19 November, 2021 10:05PM by glandium

hackergotchi for Gunnar Wolf

Gunnar Wolf

For our millionth bug, bookworms eat raspberries alive

I guess you already heard, right? The Debian Bug Tracking System has hit a big milestone! We just passed our one millionth bug report! (and yes, that’s a cause for celebration; bug reporting is probably the best way for the system to grow and improve)

So, to celebrate, I want to announce I have nudged our unofficial Raspberry Pi images build scripts to now also build images for our upcoming Debian release, Debian 12 «Bookworm»

(image above: A bookworm learns about raspberries in various stages of testing. Image sources: Transformers Wiki, CC BY-SA and Sam Saunders at Flickr, CC BY-SA)

So… Get’em while they are fresh! https://raspi.debian.net/! And enjoy the following (non-book)worm-on-a-raspberry picture from Wikimedia Commons:

Oh, FWIW – The site still shows images for Buster. You will notice they are no longer being autobuilt (why spend CPU time in something that’s no longer going to change significatively?). The Bookworm images are not yet tested; as soon as I can test them, I will drop the Buster ones.

19 November, 2021 03:37PM

hackergotchi for Evgeni Golov

Evgeni Golov

A String is not a String, and that's Groovy!

Halloween is over, but I still have some nightmares to share with you, so sit down, take some hot chocolate and enjoy :)

When working with Jenkins, there is almost no way to avoid writing Groovy. Well, unless you only do old style jobs with shell scripts, but y'all know what I think about shell scripts…

Anyways, Eric have been rewriting the jobs responsible for building Debian packages for Foreman to pipelines (and thus Groovy).

Our build process for pull requests is rather simple:

  1. Setup sources - get the orig tarball and adjust changelog to have an unique version for pull requests
  2. Call pbuilder
  3. Upload the built package to a staging archive for testing

For merges, it's identical, minus the changelog adjustment.

And if there are multiple packages changed in one go, it runs each step in parallel for each package.

Now I've been doing mass changes to our plugin packages, to move them to a shared postinst helper instead of having the same code over and over in every package. This required changes to many packages and sometimes I'd end up building multiple at once. That should be fine, right?

Well, yeah, it did build fine, but the upload only happened for the last package. This felt super weird, especially as I was absolutely sure we did test this scenario (multiple packages in one PR) and it worked just fine…

So I went on a ride though the internals of the job, trying to understand why it didn't work.

This requires a tad more information about the way we handle packages for Foreman:

  • the archive is handled by freight
  • it has suites like buster, focal and plugins (that one is a tad special)
  • each suite has components that match Foreman releases, so 2.5, 3.0, 3.1, nightly etc
  • core packages (Foreman etc) are built for all supported distributions (right now: buster and focal)
  • plugin packages are built only once and can be used on every distribution

As generating the package index isn't exactly fast in freight, we tried not not run it too often. The idea was that when we build two packages for the same target (suite/version combination), we upload both at once and run import only once for both. That means that when we build Foreman for buster and focal, this results in two parallel builds and then two parallel uploads (as they end up in different suites). But if we build Foreman and Foreman Installer, we have four parallel builds, but only two parallel uploads, as we can batch upload Foreman and Installer per suite. Well, or so was the theory.

The Groovy code, that was supposed to do this looked roughly like this:

def packages_to_build = find_changed_packages()
def repos = [:]

packages_to_build.each { pkg ->
    suite = 'buster'
    component = '3.0'
    target = "${suite}-${component}"

    if (!repos.containsKey(target)) {
        repos[target] = []
    }

    repos[target].add(pkg)
}

do_the_build(packages_to_build)
do_the_upload(repos)

That's pretty straight forward, no? We create an empty Map, loop over a list of packages and add them to an entry in the map which we pre-create as empty if it doesn't exist.

Well, no, the resulting map always ended with only having one element in each target list. And this is also why our original tests always worked: we tested with a PR containing changes to Foreman and a plugin, and plugins go to this special target we have…

So I started playing with the code (https://groovyide.com/playground is really great for that!), trying to understand why the heck it erases previous data.

The first finding was that it just always ended up jumping into the "if map entry not found" branch, even though the map very clearly had the correct entry after the first package was added.

The second one was weird. I was trying to minimize the reproducer code (IMHO always a good idea) and switched target = "${suite}-${component}" to target = "lol". Two entries in the list, only one jump into the "map entry not found" branch. What?! 🧐

So this is clearly related to the fact that we're using String interpolation here. But hey, that's a totally normal thing to do, isn't it?!

Admittedly, at this point, I was lost. I knew what breaks, but not why.

Luckily, I knew exactly who to ask: Jens.

After a brief "well, that's interesting", Jens quickly found the source of our griefs: Double-quoted strings are plain java.lang.String if there’s no interpolated expression, but are groovy.lang.GString instances if interpolation is present.. And when we do repos[target] the GString target gets converted to a String, but when we use repos.containsKey() it remains a GString. This is because GStrings get converted to Strings, if the method wants one, but containsKey takes any Object while the repos[target] notation for some reason converts it. Maybe this is because using GString as Map keys should be avoided.

We can reproduce this with simpler code:

def map = [:]
def something = "something"
def key = "${something}"
map[key] = 1
println key.getClass()
map.keySet().each {println it.getClass() }
map.keySet().each {println it.equals(key)}
map.keySet().each {println it.equals(key as String)}

Which results in the following output:

class org.codehaus.groovy.runtime.GStringImpl
class java.lang.String
false
true

With that knowledge, the fix was to just use the same repos[target] notation also for checking for existence — Groovy helpfully returns null which is false-y when it can't find an entry in a Map absent.

So yeah, a String is not always a String, and it'll bite you!

19 November, 2021 02:16PM by evgeni

hackergotchi for Neil Williams

Neil Williams

git worktrees

A few scenarios have been problematic with git and I've now discovered git worktrees which help with each.

  • If you've wanted to compare multiple files in different branches of the same tree - without needing to commit on either side.
  • If you want to work on two (or more) versions of the same file at the same time, again without needing to commit.
  • You have a file or a bunch of files that aren't ready to be committed, even locally.
  • You are working on a development branch and an urgent fix is required on an old git tag.
  • You have a large git repository which is a burden to clone (or has complex submodules).

You could go to the trouble of making a new directory and re-cloning the same tree. However, a local commit in one tree is then not accessible to the other tree.

You could commit everything every time, but with a dirty tree, that involves sorting out the .gitignore rules as well. That could well be pointless with an experimental change.

Git worktrees allow multiple filesystems from a single git tree. Commits on any branch are visible from other branches, even when the commit was on a different worktree. This makes things like cherry-picking easy, without needing to push pointless changes or branches.

Branches on a worktree can be rebased as normal, with the benefit that commit hashes from other local changes are available for reference and cherry-picks.

I'm sure git worktrees are not new. However, I've only started using them recently and others have asked about how the worktree operates.

Creating a new tree can be done with a new or existing branch. To make it easier, set the new directory at the same time, usually in ../

New branch (branched from the current branch):

git worktree add -b branch_name ../branch_name

Existing branch - note, slightly different syntax here, specify the commit-ish last (branch name, tag or hash):

git worktree add ../branch_name branch_name
git worktree list
/home/neil/Documents/testing/testrepo        0612677 [master]
/home/neil/Documents/testing/testtree        d38f5a3 [testtree]

Use git worktree remove <name> to drop the entire directory for that tree and the git tracking.

I'm using this for work on the Debian Security Tracker. I have two local branches and having two worktrees allows me to have three terminals open, using the same files and the same git repository.

One to run make serve and update the local SQLite database. One to access master to run git pull One to make local changes without risking collisions on master.

git add data/CVE/list
git commit
# pre commit hook runs here
git log -n 1
# copy the hash
# switch to master terminal
git pull
git cherry-pick <HASH>
git push
# switch to server terminal
git rebase master
# no git pull or fetch, it's all local
make
# switch back to changes terminal
git rebase master

Sadly, one area where this isn't as easy is with importing a new DSC into Salsa with git-buildpackage as that uses several branches at the same time. It would be possible but you'll need to have a separate upstream and possibly pristine-tar branches and supply the relevant options. Possibly something git-buildpackage to adopt - it is common to need to make changes to the packaging with a new upstream release & a lot of those changes are currently done outside git.

For the rest of the support, see git worktree (1)

19 November, 2021 01:26PM by Neil Williams

hackergotchi for Bits from Debian

Bits from Debian

New Debian Developers and Maintainers (September and October 2021)

The following contributors got their Debian Developer accounts in the last two months:

  • Bastian Germann (bage)
  • Gürkan Myczko (tar)

The following contributors were added as Debian Maintainers in the last two months:

  • Clay Stan
  • Daniel Milde
  • David da Silva Polverari
  • Sunday Cletus Nkwuda
  • Ma Aiguo
  • Sakirnth Nagarasa

Congratulations!

19 November, 2021 12:00PM by Jean-Pierre Giraud

hackergotchi for Mike Gabriel

Mike Gabriel

Improbability of a million, lintian thinks...

An interesting mindset overcome by reality...

Also, lintian does not differentiate between between 100.000 and 1.000.000.

W: ayatana-indicator-display: improbable-bug-number-in-closes 1000143
N: 
N:   The most recent changelog closes a low-numbered bug number. While this is distantly possible, it's more likely a typo or
N:   a placeholder value that mistakenly wasn't filled in.
N: 
N:   Visibility: warning
N:   Show-Always: no
N:   Check: debian/changelog
N: 
N:

¯\_(ツ)_/¯

light+love
Mike

19 November, 2021 07:08AM by sunweaver

hackergotchi for Dirk Eddelbuettel

Dirk Eddelbuettel

RcppArmadillo 0.10.7.3.0 on CRAN: Bugfix, New Features

armadillo image

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 928 other packages on CRAN.

I somehow missed to blog and tweet about the recent release based on the Armadillo 10.7.3 upstream release. Conrad is in “long-term support mode”, and 10.7.* is meant to provide fixes and stability relative to the most recent release which we did on September 30. We did actually find a regression when checking reverse-dependencies requiring an upstream move to 10.7.3. At the same time, we folded pull request #352 in. It addresses an old bug of ours where Armadillo fields types were not converted correctly in all dimensions.

So no we do have 0.10.7.3.0 on CRAN, as well as 0.10.7.3.1 with the (opt-in) field) fixes on the drat repo) repo. As the change perturbs a few existing packages, it is opt-in for now. We will likely aim at a proper deprecation of the old behaviour and give packages time to adjust. Stay tuned.

With that, big thanks to Jonathan Berrisch for filing issue #351 and basically addressing it in pull request #352. Very nice work, and I basically just wrapped a few more tests around it.

The full set of changes (since the last CRAN release 0.10.7.0.) follows.

Changes in RcppArmadillo version 0.10.7.3.1 (2021-11-18)

  • Correct dimensions setting in import/export of arma::field types, protected by #define (Jonathan Berrisch in #352 fixing #351)

  • Add unit tests for fields both with and without new #define (Dirk)

Changes in RcppArmadillo version 0.10.7.3.0 (2021-11-04)

  • Upgraded to Armadillo release 10.7.3

    • fix regression in alias handling by fliplr(), flipud(), reverse()

Changes in RcppArmadillo version 0.10.7.2.0 (2021-11-02)

  • Upgraded to Armadillo release 10.7.2

    • more robust handling of diagonal matrices by pinv()

Changes in RcppArmadillo version 0.10.7.1.0 (2021-10-08)

  • Upgraded to Armadillo release 10.7.1

    • fix regression in interactions between dense matrix subviews and sparse matrices

Courtesy of my CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

If you like this or other open-source work I do, you can sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

19 November, 2021 12:12AM

Reproducible Builds (diffoscope)

diffoscope 193 released

The diffoscope maintainers are pleased to announce the release of diffoscope version 193. This version includes the following changes:

[ Chris Lamb ]
* Don't duplicate file lists at each directory level.
  (Closes: #989192, reproducible-builds/diffoscope#263)
* When pretty-printing JSON, mark the difference as such, additionally
  avoiding including the full path.
  (Closes: reproducible-builds/diffoscope#205)

* Codebase improvements:
  - Update a bunch of %-style string interpolations into f-strings or
    str.format.
  - Import itertools top-level directly.
  - Drop some unused imports.
  - Use isinstance(...) over type(...) ==
  - Avoid aliasing variables if we aren't going to use them.

[ Brandon Maier ]
* Fix missing diff output on large diffs.

[ Mattia Rizzolo ]
* Ignore a Python warning coming from a dependent library (triggered by
  supporting Python 3.10)
* Document that support both Python 3.9 and 3.10.

You find out more by visiting the project homepage.

19 November, 2021 12:00AM

November 17, 2021

hackergotchi for Rapha&#235;l Hertzog

Raphaël Hertzog

Freexian’s report about Debian Long Term Support, October 2021

A Debian LTS logo

Every month we review the work funded by Freexian’s Debian LTS offering. Please find the report for October below.

Debian project funding

  • Our project funding work continues with an active bid on the work of packaging gradle in Debian. The next steps are reviewing the bid and formal approval.
  • In October 2,475 EUR was put aside to fund Debian projects.

We’re looking forward to receiving more projects from various Debian teams! Learn more about the rationale behind this initiative in this article.

Debian LTS contributors

In October 12 contributors were paid to work on Debian LTS, their reports are available below.

  • Adrian Bunk did 40.5h in October (out of 28.5h assigned and 18h remaining, thus keeping 6h for November).
  • Anton Gladky did 12h (out of 12h assigned).
  • Ben Hutchings did 14.75h in October (out of 2h assigned and 28h remaining, thus keeping 15.25h for November).
  • Chris Lamb did 18h (out of 18h assigned).
  • Holger Levsen did 1h (out of 12h assigned, but gave back the remaining 11h).
  • Jeremiah Foster worked 20h (out of 20h assigned and 10h remaining, thus keeping 10h for November).
  • Markus Koschany did 28.5h (out of 28.5h assigned).
  • Ola Lundqvist did 5h (out of 5h assigned).
  • Roberto C. Sánchez did 28.5h (out of 28.5h assigned).
  • Sylvain Beucler did 23.5h (out of 28.5h assigned, but gave back the remaining 5h).
  • Thorsten Alteholz did 28.5h (out of 28.5h assigned).
  • Utkarsh Gupta did 28.5h (out of 28.5h assigned).

Evolution of the situation

In October we released 34 DLAs.

Also, we would like to remark once again that we are constantly looking for new contributors. Please contact Jeremiah if you are interested!

The security tracker currently lists 37 packages with a known CVE and the dla-needed.txt file has 22 packages needing an update.

Thanks to our sponsors

Sponsors that joined recently are in bold.

17 November, 2021 04:51PM by Raphaël Hertzog

hackergotchi for Christoph Berg

Christoph Berg

PostgreSQL and Undelete

pg_dirtyread

Earlier this week, I updated pg_dirtyread to work with PostgreSQL 14. pg_dirtyread is a PostgreSQL extension that allows reading "dead" rows from tables, i.e. rows that have already been deleted, or updated. Of course that works only if the table has not been cleaned-up yet by a VACUUM command or autovacuum, which is PostgreSQL's garbage collection machinery.

Here's an example of pg_dirtyread in action:

# create table foo (id int, t text);
CREATE TABLE
# insert into foo values (1, 'Doc1');
INSERT 0 1
# insert into foo values (2, 'Doc2');
INSERT 0 1
# insert into foo values (3, 'Doc3');
INSERT 0 1

# select * from foo;
 id │  t
────┼──────
  1 │ Doc1
  2 │ Doc2
  3 │ Doc3
(3 rows)

# delete from foo where id < 3;
DELETE 2

# select * from foo;
 id │  t
────┼──────
  3 │ Doc3
(1 row)

Oops! The first two documents have disappeared.

Now let's use pg_dirtyread to look at the table:

# create extension pg_dirtyread;
CREATE EXTENSION

# select * from pg_dirtyread('foo') t(id int, t text);
 id │  t
────┼──────
  1 │ Doc1
  2 │ Doc2
  3 │ Doc3

All three documents are still there, but only one of them is visible.

pg_dirtyread can also show PostgreSQL's system colums with the row location and visibility information. For the first two documents, xmax is set, which means the row has been deleted:

# select * from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text);
 ctid  │ xmin │ xmax │ id │  t
───────┼──────┼──────┼────┼──────
 (0,1) │ 1577 │ 1580 │  1 │ Doc1
 (0,2) │ 1578 │ 1580 │  2 │ Doc2
 (0,3) │ 1579 │    0 │  3 │ Doc3
(3 rows)

Undelete

Caveat: I'm not promising any of the ideas quoted below will actually work in practice. There are a few caveats and a good portion of intricate knowledge about the PostgreSQL internals might be required to succeed properly. Consider consulting your favorite PostgreSQL support channel for advice if you need to recover data on any production system. Don't try this at work.

I always had plans to extend pg_dirtyread to include some "undelete" command to make deleted rows reappear, but never got around to trying that. But rows can already be restored by using the output of pg_dirtyread itself:

# insert into foo select * from pg_dirtyread('foo') t(id int, t text) where id = 1;

This is not a true "undelete", though - it just inserts new rows from the data read from the table.

pg_surgery

Enter pg_surgery, which is a new PostgreSQL extension supplied with PostgreSQL 14. It contains two functions to "perform surgery on a damaged relation". As a side-effect, they can also make delete tuples reappear.

As I discovered now, one of the functions, heap_force_freeze(), works nicely with pg_dirtyread. It takes a list of ctids (row locations) that it marks "frozen", but at the same time as "not deleted".

Let's apply it to our test table, using the ctids that pg_dirtyread can read:

# create extension pg_surgery;
CREATE EXTENSION

# select heap_force_freeze('foo', array_agg(ctid))
    from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text) where id = 1;
 heap_force_freeze
───────────────────

(1 row)

Et voilà, our deleted document is back:

# select * from foo;
 id │  t
────┼──────
  1 │ Doc1
  3 │ Doc3
(2 rows)

# select * from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text);
 ctid  │ xmin │ xmax │ id │  t
───────┼──────┼──────┼────┼──────
 (0,1) │    2 │    0 │  1 │ Doc1
 (0,2) │ 1578 │ 1580 │  2 │ Doc2
 (0,3) │ 1579 │    0 │  3 │ Doc3
(3 rows)

Disclaimer

Most importantly, none of the above methods will work if the data you just deleted has already been purged by VACUUM or autovacuum. These actively zero out reclaimed space. Restore from backup to get your data back.

Since both pg_dirtyread and pg_surgery operate outside the normal PostgreSQL MVCC machinery, it's easy to create corrupt data using them. This includes duplicated rows, duplicated primary key values, indexes being out of sync with tables, broken foreign key constraints, and others. You have been warned.

pg_dirtyread does not work (yet) if the deleted rows contain any toasted values. Possible other approaches include using pageinspect and pg_filedump to retrieve the ctids of deleted rows.

Please make sure you have working backups and don't need any of the above.

17 November, 2021 03:46PM

November 16, 2021

hackergotchi for Paul Tagliamonte

Paul Tagliamonte

Measuring the Power Output of my SDRs ⚡

Over the last few years, I’ve often wondered what the true power output of my SDRs are. It’s a question with a shocking amount of complexity in the response, due to a number of factors (mostly Frequency). The ranges given in spec sheets are often extremely vague, and if I’m being honest with myself, not incredibly helpful for being able to determine what specific filters and amplifiers I’ll need to get a clean signal transmitted.

Hey, heads up! - This post contains extremely unvalidated and back of the napkin quality work to understand how my equipment works. Hopefully this work can be of help to others, but please double check any information you need for your own work!

I was specifically interested in what gain output (in dBm) looks like across the frequency range – in particular, how variable the output dBm is when I change frequencies. The second question I had was understanding how linear the output gain is when adjusting the requested gain from the radio. Does a 2 dB increase on a HackRF API mean 2 dB of gain in dBm, no matter what the absolute value of the gain stage is?

I’ve finally bit the bullet and undertaken work to characterize the hardware I do have, with some outdated laboratory equipment I found on eBay. Of course, if it’s worth doing, it’s worth overdoing, so I spent a bit of time automating a handful of components in order to collect the data that I need from my SDRs.

I bought an HP 437B, which is the cutting edge of 30 years ago, but still accurate to within 0.01dBm. I paired this Power Meter with an Agilent 8481A Power Sensor (-30 dBm to 20 dBm from 10MHz to 18GHz). For some of my radios, I was worried about exceeding the 20 dBm mark, so I used a 20db attenuator while I waited for a higher power power sensor. Finally, I was able to find a GPIB to USB interface, and get that interface working with the GPIB Kernel driver on my system.

With all that out of the way, I was able to write Go bindings to my HP 437B to allow for totally headless and automated control in sync with my SDR’s RF output. This allowed me to script the transmission of a sine wave at a controlled amplitude across a defined gain range and frequency range and read the Power Sensor’s measured dBm output to characterize the Gain across frequency and configured Gain.

HackRF

Looking at configured Gain against output power, the requested gain appears to have a fairly linear relation to the output signal power. The measured dBm ranged between the sensor noise floor to approx +13dBm. The average standard deviation of all tested gain values over the frequency range swept was +/-2dBm, with a minimum standard deviation of +/-0.8dBm, and a maximum of +/-3dBm.

When looking at output power over the frequency range swept, the HackRF contains a distinctive (and frankly jarring) ripple across the Frequency range, with a clearly visible jump in gain somewhere around 2.1GHz. I have no idea what is causing this massive jump in output gain, nor what is causing these distinctive ripples. I’d love to know more if anyone’s familiar with HackRF’s RF internals!

PlutoSDR

The power output is very linear when operating above -20dB TX channel gain, but can get quite erratic the lower the output power is configured. The PlutoSDR’s output power is directly related to the configured power level, and is generally predictable once a minimum power level is reached. The measured dBm ranged from the noise floor to 3.39 dBm, with an average standard deviation of +/-1.98 dBm, a minimum standard deviation of +/-0.91 dBm and a maximum standard deviation of +/-3.37 dBm.

Generally, the power output is quite stable, and looks to have very even and wideband gain control. There’s a few artifacts, which I have not confidently isolated to the SDR TX gain, noise (transmit artifacts such as intermodulation) or to my test setup. They appear fairly narrowband, so I’m not overly worried about them yet. If anyone has any ideas what this could be, I’d very much appreciate understanding why they exist!

Ettus B210

The power output on the Ettus B210 is higher (in dBm) than any of my other radios, but it has a very odd quirk where the power becomes nonlinear somewhere around -55dB TX channel gain. After that point, adding gain has no effect on the measured signal output in dBm up to 0 dB gain. The measured dBm ranged from the noise floor to 18.31 dBm, with an average standard deviation of +/-2.60 dBm, a minimum of +/-1.39 dBm and a maximum of +/-5.82 dBm.

When the Gain is somewhere around the noise floor, the measured gain is incredibly erratic, which throws the maximum standard deviation significantly. I haven’t isolated that to my test setup or the radio itself. I’m inclined to believe it’s my test setup. The radio has a fairly even and wideband gain, and so long as you’re operating between -70dB to -55dB, fairly linear as well.

Summary

Of all my radios, the Ettus B210 has the highest output (in dBm) over the widest frequency range, but the HackRF is a close second, especially after the gain bump kicks in around 2.1GHz. The Pluto SDR feels the most predictable and consistent, but also a very low output, comparatively - right around 0 dBm.

Name Max dBm stdev dBm stdev min dBm stdev max dBm
HackRF +12.6 +/-2.0 +/-0.8 +/-3.0
PlutoSDR +3.3 +/-2.0 +/-0.9 +/-3.7
B210 +18.3 +/-2.6 +/-1.4 +/-6.0

16 November, 2021 03:06AM

November 15, 2021

Vincent Bernat

Git as a source of truth for network automation

The first step when automating a network is to build the source of truth. A source of truth is a repository of data that provides the intended state: the list of devices, the IP addresses, the network protocols settings, the time servers, etc. A popular choice is NetBox. Its documentation highlights its usage as a source of truth:

NetBox intends to represent the desired state of a network versus its operational state. As such, automated import of live network state is strongly discouraged. All data created in NetBox should first be vetted by a human to ensure its integrity. NetBox can then be used to populate monitoring and provisioning systems with a high degree of confidence.

When introducing Jerikan, a common feedback we got was: “you should use NetBox for this.” Indeed, Jerikan’s source of truth is a bunch of YAML files versioned with Git.

Why Git?

If we look at how things are done with servers and services, in a datacenter or in the cloud, we are likely to find users of Terraform, a tool turning declarative configuration files into infrastructure. Declarative configuration management tools like Salt, Puppet,1 or Ansible take care of server configuration. NixOS is an alternative: it combines package management and configuration management with a functional language to build virtual machines and containers. When using a Kubernetes cluster, people use Kustomize or Helm, two other declarative configuration management tools. Tapped together, these tools implement the infrastructure as code paradigm.

Infrastructure as code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. You make changes to code, then use automation to test and apply those changes to your systems.

― Kief Morris, Infrastructure as Code, O’Reilly.

A version control system is a central tool for infrastructure as code. The usual candidate is Git with a source code management system like GitLab or GitHub. You get:

Traceability and visibility
Git keeps a log of all changes: what, who, why, and when. With a bit of discipline, each change is explained and self-contained. It becomes part of the infrastructure documentation. When the support team complains about a degraded experience for some customers over the last two months or so, you quickly discover this may be related to a change to an incoming policy in New York.
Rolling back
If a change is defective, it can be reverted quickly, safely, and without much effort, even if other changes happened in the meantime. The policy change at the origin of the problem spanned over three routers. Reverting this specific change and deploying the configuration let you solve the situation until you find a better fix.
Branching, reviewing, merging
When working on a new feature or refactoring some part of the infrastructure, a team member creates a branch and works on their change without interfering with the work of other members. Once the branch is ready, a pull request is created and the change is ready to be reviewed by the other team members before merging. You discover the issue was related to diverting traffic through an IX where one ISP was connected without enough capacity. You propose and discuss a fix that includes a change of the schema and the templates used to declare policies to be able to handle this case.
Continuous integration
For each change, automated tests are triggered. They can detect problems and give more details on the effect of a change. Branches can be deployed to a test infrastructure where regression tests are executed. The results can be synthesized as a comment in the pull request to help the review. You check your proposed change does not modify the other existing policies.

Why not NetBox?

NetBox does not share these features. It is a database with a REST and a GraphQL API. Traceability is limited: changes are not grouped into a transaction and they are not documented. You cannot fork the database. Usually, there is one staging database to test modifications before applying them to the production database. It does not scale well and reviews are difficult. Applying the same change to the production database can be hazardous. Rolling back a change is non-trivial.

Update (2021-11)

Nautobot, a fork of NetBox, will soon address this point by using Dolt, an SQL database engine allowing you to clone, branch, and merge, like a Git repository. Dolt is compatible with MySQL clients. See “Nautobots, Roll Back!” for a preview of this feature.

Moreover, NetBox is not usually the single source of truth. It contains your hardware inventory, the IP addresses, and some topology information. However, this is not the place you put authorized SSH keys, syslog servers, or the BGP configuration. If you also use Ansible, this information ends in its inventory. The source of truth is therefore fragmented between several tools with different workflows. Since NetBox 2.7, you can append additional data with configuration contexts. This mitigates this point. The data is arranged hierarchically but the hierarchy cannot be customized.2 Nautobot can manage configuration contexts in a Git repository, while still allowing to use of the API to fetch them. You get some additional perks, thanks to Git, but the remaining data is still in a database with a different lifecycle.

Lastly, the schema used by NetBox may not fit your needs and you cannot tweak it. For example, you may have a rule to compute the IPv6 address from the IPv4 address for dual-stack interfaces. Such a relationship cannot be easily expressed and enforced in NetBox. When changing the IPv4 address, you may forget the IPv6 address. The source of truth should only contain the IPv4 address but you also want the IPv6 address in NetBox because this is your IPAM and you need it to update your DNS entries.

Why not Git?

There are some limitations when putting your source of truth in Git:

  1. If you want to expose a web interface to allow an external team to request a change, it is more difficult to do it with Git than with a database. Out-of-the-box, NetBox provides a nice web interface and a permission system. You can also write your own web interface and interact with NetBox through its API.
  2. YAML files are more difficult to query in different ways. For example, looking for a free IP address is complex if they are scattered in multiple places.

In my opinion, in most cases, you are better off putting the source of truth in Git instead of NetBox. You get a lot of perks by doing that and you can still use NetBox as a read-only view, usable by other tools. We do that with an Ansible module. In the remaining cases, Git could still fit the bill. Read-only access control can be done through submodules. Pull requests can restrict write access: a bot can check the changes only modify allowed files before auto-merging. This still requires some Git knowledge, but many teams are now comfortable using Git, thanks to its ubiquity.


  1. Wikimedia manages its infrastructure with Puppet. They publish everything on GitHub. Creative Commons uses Salt. They also publish everything on GitHub. Thanks to them for doing that! I wish I could provide more real-life examples. ↩︎

  2. Being able to customize the hierarchy is key to avoiding repetition in the data. For example, if switches are paired together, some data should be attached to them as a group and not duplicated on each of them. Tags can be used to partially work around this issue but you lose the hierarchical aspect. ↩︎

15 November, 2021 07:53AM by Vincent Bernat

November 14, 2021

Russ Allbery

Review: The Last Graduate

Review: The Last Graduate, by Naomi Novik

Series: The Scholomance #2
Publisher: Del Rey
Copyright: 2021
ISBN: 0-593-12887-7
Format: Kindle
Pages: 388

This is a direct sequel to A Deadly Education, by which I mean it starts in the same minute at which A Deadly Education ends (and let me say how grateful I am for a sequel that doesn't drop days, months, or years between books). You do not want to read this series out of order.

This book is also very difficult to review without spoiling either it or the previous book, so please bear with me if I'm elliptical in my ravings. Because The Last Graduate is so good. So good, not only as a piece of writing, but as a combination of two of my favorite tropes in fiction, one of which I can't talk about because of spoilers. I adored this book in a way that is not entirely rational.

I will attempt a review below anyway, but if you liked the first book, just stop reading here and go read the second one. It's more of everything I loved in the first book except even better, it did some things I was expecting and some things I didn't expect at all, and it's just so ridiculously good. Just be aware that it has another final-line cliffhanger. The third book is coming in (hopefully) 2022.

Novik handles the cliffhanger at the end of the previous book beautifully, which is worth noting because there were so many ways in which it could have gone poorly. One of the best things about this series is Novik's skill at writing El's relationship with her mother, even though her mother has not appeared in the series so far. El argues with her mother's voice in her head, tells stories about her, wonders what her mother would think of her classmates (or in some cases knows exactly what her mother would think of her classmates), and sometimes makes the explicit decision to not be her mother. The relationship has the sort of messy complexity, shared history, and underlying respect that many people experience in life but that I've rarely seen portrayed this well in a fantasy novel.

Novik's presentation of that relationship works because El's voice is so strong. Within fifteen minutes of starting The Last Graduate, I was already muttering "I love this book" to myself, mostly because of how much I enjoy El's sarcastic, self-deprecating internal commentary. Novik strikes a balance between self-awareness, snark, humor, and real character growth that rivals Murderbot in its effectiveness of first-person perspective. It carries the story over a few weak points, such as a romance that didn't do much for me. Even when I didn't care about part of the plot, I cared about El's opinion of the plot and what it said about El's growing understanding of how to navigate the world.

A Deadly Education was scene and character establishment. El insisted on being herself and following her own morals and social rules, and through that found some allies. The Last Graduate gives El enough breathing space to make more nuanced decisions. This is the part of growing up where one realizes the limitations of one's knee-jerk reactions and innate moral judgment. It's also when it becomes hard to trust success that is entirely outside of one's previous experience. El was not a kid who had friends, so she doesn't know what to do with them now that she has them. She's barely able to convince herself that they are friends.

This is one of the two fictional tropes I mentioned, the one that I can talk about (at least briefly) without major spoilers. I have such a soft spot for stubborn, sarcastic, principled characters who refuse to play by the social rules that they think are required to make friends and who then find friends who like them for themselves. The moment when they start realizing this has happened and have no idea how to deal with it or how to be a person who has friends is one I will happily read over and over again. I enjoyed this book from the beginning, but there were two points when it grabbed my heart and I was all in. The first one is a huge spoiler that I can't talk about. The second was this paragraph:

[She] came round to me and put her arm around my waist and said under her breath, "Hey, she can be taught," with a tease in her voice that wobbled a little, and when I looked at her, her eyes were bright and wet, and I put my arm around her shoulders and hugged her.

You'll know it when you get there.

The Last Graduate also gives the characters other than El and Orion more room, which is part of how it handles the chosen one trope. It's been obvious since early in the first book that Orion is a sort of chosen one, and it becomes obvious to the reader that El may be as well. But Novik doesn't let the plot focus only on them; instead, she uses that trope to look at how alliances and collective action happen, and how no one can carry the weight by themselves. As El learns more and gains power, she also becomes less central to the plot resolution and has to learn how to be less self-reliant. This is not a book where one character is trained to save the world. It's a book where she manages to enlist the support of a kick-ass project manager and becomes part of a team.

Middle books of a trilogy are notoriously challenging. Often they're travel books: the first book sets up a problem, the second book moves the characters both physically and emotionally into a position to solve the problem, and the third book is the payoff. Travel books often sag. They can feel obligatory but somewhat boring, like a chore on the way to the third-book climax. The Last Graduate is not a travel book; it is, instead, a pivot book, which is my favorite form of trilogy. It's a book that rewrites the problem the first book set up, both resolving it and expanding the scope beyond what the reader had expected. This is immensely satisfying when done well, and Novik does it extremely well.

This is not a flawless book. There are some pacing hiccups, there is a romance angle that didn't work for me (although it does arrive at some character insights that I thought were spot on), and although I think Novik is doing something interesting with the trope, there is a lot of chosen one power escalation happening here. It's not the sort of book that I can claim is perfectly written. Instead, it's the sort of book that uses some of my favorite plot elements and emotional beats in such an effective way and with such a memorable character that I do not have it in me to care about any of the flaws. Your mileage may therefore vary, but I would be happy to read books like this until the end of time.

As mentioned above, The Last Graduate ends on another cliffhanger. This time I was worried that Novik might have ended the series there, since there's enough of an internal climax that I could imagine some literary fiction (which often seems allergic to endings) would have stopped here. Thankfully, Novik's web site says this is not the case. The next year is going to be a difficult wait.

The third book of this series is going to be incredibly difficult to write, and I hope Novik is up to the challenge she's made for herself. But she handled the transition between the first and second book so well, and this book is so good that I have a lot of hope. If the third book is half as good as I'm hoping, this is going to be one of my favorite fantasy series of all time.

Followed by an as-yet-untitled third book.

Rating: 10 out of 10

14 November, 2021 04:49AM

Ruby Team

Ruby transition and packaging hints #2 - Gemfile.lock created by bundler/setup with Ruby 2.7 preventing successful test with Ruby 3.0

We currently face an issue in all packages requiring bunlder/setup and trying to run the tests for Ruby 2.7 and 3.0. The problem is that the first tests will create Gemfile.lock (or gemfile/gemfile-*.lock) using Ruby 2.7 and the next run for Ruby 3 will report e.g.:

Failure/Error: require 'bundler/setup' # Set up gems listed in the Gemfile.

Bundler::GemNotFound:
  Could not find racc-1.4.16 in any of the sources

or

/usr/share/rubygems-integration/all/gems/bundler-2.2.27/lib/bundler/definition.rb:496:in `materialize':
  Could not find rexml-3.2.3.1 in any of the sources (Bundler::GemNotFound)

Both bugs #996207 and #996302 are incarnations of this issue. The fix is as easy as making sure that the .lock files are removed before each run. This can be done in e.g. debian/ruby-tests.rake as very first task:

File.delete("Gemfile.lock") if File.exist?("Gemfile.lock")

In another case the .lock file is created by the tests in gemfiles/. While the first examples could actually be solved by gem2deb removing Gemfile.lock on its own, I’m not quite sure how to handle the last case using packaging tools.

The interesting part is that we will unlikely be confronted with this issue anytime soon again. It seems very specific to the Ruby 3.0 transition.

Update

After talking to Antonio he added some code to gem2deb-test-runner to moving Gemfile.lock files out of the way. The tool already did this in an autopkgtest environment. In the upcoming 1.7 release it will do it in general and this will fix some more FTBFSes, e.g. #998497 and #996141 - originally reported against ruby-voight-kampff and ruby-bootsnap.

14 November, 2021 03:25AM by Daniel Leidert (dleidert@debian.org)

November 13, 2021

Ruby transition and packaging hints #1 - Adjusting Ruby version in commands

This is the first part of a series of short posts about issues that came up during the Ruby 3.0 transition and how to fix them. Hopefully more team members will join in and add their input.

During the Ruby 3.0 transition there are essentially two different Ruby versions with two different binaries available, /usr/bin/ruby2.7 and /usr/bin/ruby3.0, while /usr/bin/ruby points to the current default version, which is Ruby 2.7.

In some cases the tests shipped by the source packages will use shell commands to run scripts or Ruby code. It is imparative that in these cases the Ruby executable is not invoked by /usr/bin/ruby or ruby, because this will point to Ruby 2.7 only and fail if the tests are invoked with Ruby version 3.

The fix is to rely on RbConfig.ruby which will point to the absolute pathname of the ruby command for the current Ruby environment, e.g.

cmd = "#{RbConfig.ruby} ..."

This issue appeard for example in ruby-byebug and ruby-backports.

13 November, 2021 08:24PM by Daniel Leidert (dleidert@debian.org)

John Goerzen

Managing an External Display on Linux Shouldn’t Be This Hard

I first started using Linux and FreeBSD on laptops in the late 1990s. Back then, there were all sorts of hassles and problems, from hangs on suspend to pure failure to boot. I still worry a bit about suspend on unknown hardware, but by and large, the picture of Linux on laptops has dramatically improved over the last years. So much so that now I can complain about what would once have been a minor nit: dealing with external monitors.

I have a USB-C dock that provides both power and a Thunderbolt display output over the single cable to the laptop. I think I am similar to most people in wanting the following behavior from the laptop:

  • When the lid is closed, suspend if no external monitor is connected. If an external monitor is connected, shut off the built-in display and use the external one exclusively, but do not suspend.
  • Lock the screen automatically after a period of inactivity.
  • While locked, all connected displays should be powered down.
  • When an external display is connected, begin using it automatically.
  • When an external display is disconnected, stop using it. If the lid is closed when the external display is disconnected, go into suspend mode.

This sounds so simple. But somehow on Linux we’ve split up these things into a dozen tiny bits:

  • In /etc/systemd/logind.conf, there are settings about what to do when the lid is opened or closed.
  • Various desktop environments have overlapping settings covering the same things.
  • Then there are the display managers (gdm3, lightdm, etc) that also get in on the act, and frequently have DIFFERENT settings, set in different places, from the desktop environments. And, what’s more, they tend to be involved with locking these days.
  • Then there are screensavers (gnome-screensaver, xscreensaver, etc.) that also enter the picture, and also have settings in these areas.

Problems I’ve Seen

My problems don’t even begin with laptops, but with my desktop, running XFCE with xmonad and lightdm. My desktop is hooked to a display that has multiple inputs. This scenario (reproducible in both buster and bullseye) causes the display to be unusable until a reboot on the desktop:

  1. Be logged in and using the desktop
  2. Without locking the desktop screen, switch the display input to another device
  3. Keep the display input on another device long enough for the desktop screen to auto-lock
  4. At this point, it is impossible to re-awaken the desktop screen.

I should not here that the problems aren’t limited to Debian, but also extend to Ubuntu and various hardware.

Lightdm: which greeter?

At some point while troubleshooting things after upgrading my laptop to bullseye, I noticed that while both were running lightdm, I had different settings and a different appearance between the two. Upon further investigation, I realized that one hat slick-greeter and lightdm-settings installed, while the other had lightdm-gtk-greeter and lightdm-gtk-greeter-settings installed. Very strange.

XFCE: giving up

I eventually gave up on making lightdm work. No combination of settings or greeters would make things work reliably when changing screen configurations. I installed xscreensaver. It doesn’t hang, but it does sometimes take a few tries before it figures out what device to display on.

Worse, since updating from buster to bullseye, XFCE no longer automatically switches audio output when the docking station is plugged in, and there seems to be no easy way to convince Pulseaudio to do this.

X-Based Gnome and derivatives… sigh.

I also tried Gnome, Mate, and Cinnamon, and all of them had various inabilities to configure things to act the way I laid out above.

I’ve long not been a fan of Gnome’s way of hiding things from the user. It now has a Windows-like situation of three distinct settings programs (settings, tweaks, and dconf editor), which overlap in strange ways and interact with systemd in even stranger ways. Gnome 3 make it quite non-intuitive to make app icons from various programs work, and so forth.

Trying Wayland

I recently decided to set up an older laptop that I hadn’t used in awhile. After reading up on Wayland, I decided to try Gnome 3 under Wayland. Both the Debian and Arch wikis note that KDE is buggy on Wayland. Gnome is the only desktop environment that supports it then, unless I want to go with Sway. There’s some appeal to Sway to this xmonad user, but I’ve read of incompatibilities of Wayland software when Gnome’s not available, so I opted to try Gnome.

Well, it’s better. Not perfect, but better. After finding settings buried in a ton of different Settings and Tweaks boxes, I had it mostly working, except gdm3 would never shut off power to the external display. Eventually I found /etc/gdm3/greeter.dconf-defaults, and aadded:

sleep-inactive-ac-timeout=60
sleep-inactive-ac-type='blank'
sleep-inactive-battery-timeout=120
sleep-inactive-battery-type='suspend'

Of course, these overlap with but are distinct from the same kinds of things in Gnome settings.

Sway?

Running without Gnome seems like a challenge; Gnome is switching audio output appropriately, for instance. I am looking at some of the Gnome Shell tiling window manager extensions and hope that some of them may work for me.

13 November, 2021 04:21PM by John Goerzen

November 12, 2021

hackergotchi for Jonathan Dowland

Jonathan Dowland

Frictionless external backups with systemd

Here's a description of how my monthly external backups are managed at a technical level. I didn't realise I hadn't written this all down anywhere yet.

What

blinkenlights!

blinkenlights!

I plug in one of two (prepared) external hard drives into my headless NAS. The NAS contains my primary data backup. A job automatically decrypts the encrypted filesystem on the drive, mounts it and synchronises the copy of my backup data on the drive from that on the NAS. Whilst this is going on, the blinkstick LED on the NAS switches to a colour to signal "in progress". When it's done, the light changes to green to signal "done" and I can remove it. If something goes wrong, it turns red and I get mail.

Why

I want a third-strand, off-site backup of my and my family's data in case of a disaster in our house. For it to be useful it has to be regular, so I needed to remove as much of the friction of performing the backup as possible.

I use two drives alternately so that I don't have all my eggs in one basket in the window when I bring one of them home and perform the backup.

How

As much as possible I lean on systemd and its ability to trigger actions based on events.

  1. External drive is plugged in. systemd instantiates a corresponding device unit, named dev-disk-by\x2duuid-aaaaaaaa\x2daaaa\x2daaaa\x2daaaa\x2daaaaaaaaaaaa.device, where aaaa… is the UUID of a partition on the device

  2. The backup job is a systemd service which has a WantedBy relationship on the device unit, so when the device appears, systemd starts the backup service.

  3. The backup service has Requires and After relationships on systemd-cryptsetup@extbackup.service, a service created by systemd's cryptsetup generator on start-up (but slightly customised, see below). The encrypted device is therefore unlocked.

  4. The backup service defines multiple start and stop commands with ExecStart and ExecStop. These are used to:

    1. set the blinkstick to the working colour (blue-ish)
    2. mount the now-decrypted filesystem
    3. get a lock on the backup repository (so nothing else writes to it) and synchronise the files
    4. unmount the filesystem
    5. set the blinkstick to the success colour (green)
  5. Finally, the systemd-cryptsetup@extbackup.service unit realises it is not required any more. It has been customised with StopWhenUnneeded=true1, so the encrypted filesystem is closed, ready for the drive to be removed.

  6. I notice the LED colour is green, remove the drive, and take it to its off-site home.

If anything goes wrong, all my custom systemd units have, as a matter of course,

OnFailure=status-email-user@%n.service blinkstick-fail.service

Preparing a new backup disk

This is mostly just a standard dm-crypt/cryptsetup/LUKS encrypted device, on top of a standard partition on the underlying disk, with a normal filesystem sitting on top: Basically, the most common way to encrypt a drive in Linux. See places like the cryptsetup docs for how to set something like that up. The key things here are

  • set up a decryption key file as well as (or instead of) a pass- phrase and store that somewhere on the filesystem of the NAS
  • back up the LUKS header, as the cryptsetup documentation stresses you should
  • make a note of the underlying partition UUID: it's needed for the WantedBy line in the backup service file. (look in /dev/disk/by-uuid before and after inserting it and see what was added)
  • label the filesystem on top of the encrypted device for convenience
  • set up a /etc/crypttab line with all the info needed to decrypt
  • set up a /etc/fstab line with all the info needed to mount (yes, really; see "Issues" below)

The backup service

Here's the backup service unit definition in its entirety:

 [Unit]
 OnFailure=status-email-user@%n.service blinkstick-fail.service
 Requires=systemd-cryptsetup@extbackup.service backup.mount
 After=systemd-cryptsetup@extbackup.service backup.mount

 [Service]
 Type=oneshot
 ExecStart=/usr/local/bin/blinkstick --index 1 --limit 10 --set-color 33c280
 ExecStart=/bin/mount /extbackup
 ExecStart=/home/jon/bin/phobos-backup-monthly
 ExecStop=/bin/umount /extbackup
 ExecStop=/usr/local/bin/blinkstick --index 1 --limit 10 --set-color green

 [Install]
 WantedBy=dev-disk-by\x2duuid-aaaaaaaa\x2daaaa\x2daaaa\x2daaaa\x2daaaaaaaaaaaa.device

The dashes in the UUID in WantedBy= need to be encoded as \x2d and then the slashes from the path bit as dashes. Using dashes to encode slashes is possibly the single most frustrating systemd design decision.

Issues

Sadly (as detailed in Blinkenlights, part 2) there are 2 some frustrating limitations with trying to handle the mount (and unmount) of the filesystem in systemd-land, so instead, it's done using the traditional mount, umount and fstab.

If you can point out any improvements to this approach, please let me know!


  1. I customized mine a while ago by copying the generated service file to a static file, but nowadays I think you could do systemd edit systemd-cryptsetup@extbackup.service to add the StopWhenUnneeded to an override file and not need the rest.
  2. or at least were. It's been a while since I revisited this part.

12 November, 2021 10:18PM

hackergotchi for Adnan Hodzic

Adnan Hodzic

wp-k8s: WordPress on privately hosted Kubernetes cluster (Raspberry Pi 4 + Synology)

Blog post you’re reading right now is privately hosted on Raspberry PI 4 Kubernetes cluster with its data coming from NFS share and MariaDB on...

The post wp-k8s: WordPress on privately hosted Kubernetes cluster (Raspberry Pi 4 + Synology) appeared first on FoolControl: Phear the penguin.

12 November, 2021 11:09AM by Adnan Hodzic

Reproducible Builds (diffoscope)

diffoscope 192 released

The diffoscope maintainers are pleased to announce the release of diffoscope version 192. This version includes the following changes:

* Update .epub test methodology after improving XML file parsing.

You find out more by visiting the project homepage.

12 November, 2021 12:00AM

diffoscope 191 released

The diffoscope maintainers are pleased to announce the release of diffoscope version 191. This version includes the following changes:

[ Chris Lamb ]
* Detect XML files as XML files if either file(1) claims if they are XML
  files, or if they are named .xml.
  (Closes: #999438, reproducible-builds/diffoscope#287)
* Don't reject Debian .changes files if they contain non-printable
  characters. (Closes: reproducible-builds/diffoscope#286)
* Continue loading a .changes file even if the referenced files inside it do
  not exist, but include a comment in the diff as a result.
* Log the reason if we cannot load a Debian .changes file.

[ Zbigniew Jędrzejewski-Szmek ]
* Fix inverted logic in the assert_diff_startswith() utility.

You find out more by visiting the project homepage.

12 November, 2021 12:00AM

Aloïs Micard

Laravel: beware of $touches

I have been using Laravel professionally since almost 1year, and I must say: I’m very impressed with the framework. Everything’s run smoothly, there’s a feature for (almost everything) you can think of, so you (almost) never need to reinvent the wheel. This is very advantageous since you only focus on building your product features by features and spend less time working on technical stuff who are less business valuable. Everything is fine… until it’s not.

12 November, 2021 12:00AM

November 10, 2021

hackergotchi for Jonathan Dowland

Jonathan Dowland

LEGO Princess Castle-books

The set

The set

My eldest daughter and I visited a LEGO shop recently and I wanted to buy her a gift. The catch was that we were going to be flying on an airplane the next day, so I wanted to find something with the lowest risk of losing parts on the plane.

We settled on Ariel, Belle, Cinderella and Tiana's Storybook Adventures which had a number of things going for it: It was reasonably priced at under £20, for the size of the set; it included four human minifigs (albeit in a sub-minifig size, some kind of munchkin size, but that did not seem to matter) and an assortment of animal accompaniments; but mostly, it folded up into a self-contained mock fairytail "book", and opened up into an enclosed "tray" play area, minimising the risk of losses on the flight.

The set in its resting state

The set in its resting state

Lego have done a few of these styles of sets, all Disney princess themed, and it looks like they have a few more on their product roadmap. The newer ones incorporate a locking mechanism with a cute Lego key. I love the concept and think it should be extended to other themes/properties. I can imagine a Lego Star Wars-themed version with a little Death Star trench in the middle, or even an original IP like Classic Space, or Medieval.

Exploring a DIY Lego book frame

Exploring a DIY Lego book frame

I really liked the Book device which reminded me of hollow books as a child. The cover and spine pieces are bespoke Lego bricks made for purpose, but I thought you could create something similar with generic parts. Holly and I had a go at the concept with what bricks we had to hand. It's definitely viable (and you could do a lot better with a wider selection of bricks / more skilled builders) and it will be fun to pick something to try and build on the spine.

10 November, 2021 09:48PM

hackergotchi for Neil Williams

Neil Williams

LetsEncrypt with Apache, Gunicorn and Debian Bullseye

This took me far too long to identify and debug, so I'm going to write it up here for my own reference and to possibly help others.

Background

Upgrading an old codebase from Python2 on Buster to Python3 ready for Bullseye and from Django1 to Django2 (prepared for Django3). Everything is fine at this stage - the Django test server is happy with HTTP and it gives enough support to do the actual code changes to get to Python3. All well and good so far. The main purpose of this particular code was to support payments, so a chunk of the testing cannot be done without HTTPS, which is where things got awkward.

This particular service needs HTTPS using LetsEncrypt and Apache2. To support Django, I typically use Gunicorn.

All of this works with HTTP. Moving to HTTPS was easy to test using the default-ssl virtual host that comes with Apache2 in Debian. It's a static page and it worked well with https. The problems all start when trying to use this known-working HTTPS config with the other Apache virtual host to add support for the gunicorn proxy.

Apache reverse proxy AH00898 – Error during SSL Handshake with remote server

Investigating

Now that I know why this happened, it's easier to see what was happening. At the time, I was swamped in a plethora of options and permutations between the Django HTTPS options and the Apache VirtualHost SSL and proxy commands. Going through all of those took up huge amounts of time, all in the wrong area.

In previous configurations using packages in Buster, gunicorn could simply run on http://localhost:8000 and Apache would proxy that as https.

In versions in Bullseye, this no longer works and it is that handover from https in Apache to http in the proxy is where it is failing.

Apache is using HTTPS because the LetsEncrypt certificates, created using dehydrated, are specified in the VirtualHost configuration. To fix the handshake error, the proxy server needs to know about the certificates created by dehydrated as well.

Gunicorn needs the certificates

The clue is in the gunicorn help:

--keyfile FILE        SSL key file [None]
--certfile FILE       SSL certificate file [None]

The final part of the puzzle is that the certificates created by dehydrated are in a private location:

drwx------ 2 root root /var/lib/dehydrated/certs/

To test gunicorn, this will mean using sudo but that's just a step towards running gunicorn as a systemd service (when access to the certs will not be a problem).

Starting gunicorn using these options shows the proxy now being available at https://localhost:8000 which is a subtle but very important change.

Environment=LOGLEVEL=DEBUG WORKERS=4 LOGFILE=/var/log/gunicorn/site.log
ExecStart=/usr/bin/gunicorn3 site.wsgi --log-level $LOGLEVEL --log-file $LOGFILE --workers $WORKERS \
--certfile /var/lib/dehydrated/certs/site/cert.pem \
--keyfile /var/lib/dehydrated/certs/site/privkey.pem

The specified locations are symbolic links created by dehydrated to cope with regular updates of the certificates using cron.

10 November, 2021 04:03PM by Neil Williams

Craig Small

Changing Grafana Legends

I’m not sure if I just can search Google properly, or this really is just not written down much, but I have had problems with Grafana Legends (I would call them the series labels). The issue is that Grafana queries Prometheus for a time series and you want to display multiple lines, but the time-series labels you get are just not quite right.

A simple example is you might be using the black-box exporter to monitor an external TCP port and you would just like the port number separate to display. The default output would look like this:

probe_duration_seconds{instance="example.net:5222",job="blackbox",module="xmpp_banner"} = 0.01
probe_duration_seconds{instance="example.net:5269",job="blackbox",module="xmpp_banner"} = 0.01

I can graph the number of seconds that it takes to probe the 5222 and 5269 TCP ports, but my graph legend is going to have the hostname, making it cluttered. I just want the legend to be the port numbers on Grafana.

The answer is to use a Prometheus function called label_replace that takes an existing label, applies a regular expression, then puts the result into another label. That’s right, regular expressions, and if you get them wrong then the label just doesn’t appear.

Perl REGEX Problems courtesy of XKCD

The label_replace documentation is a bit terse, and in my opinion, the order of parameters is messed up, but after a few goes I had what I needed:

label_replace(probe_duration_seconds{module="xmpp_banner"}, "port", "$1", "instance", ".*:(.*)")

probe_duration_seconds{instance="example.net:5222",job="blackbox",module="xmpp_banner",port="5222"}	0.001
probe_duration_seconds{instance="example.net:5269",job="blackbox",module="xmpp_banner",port="5269"}	0.002

The response now has a new label (or field if you like) called port. So what is this function to our data coming from probe_duration_seconds? The function format is:

label_replace(value, dst_label, replacement, src_label, regex)

So the function does the following:

  1. Evaluate value, which is generally some sort of query such as probe_duration_seconds
  2. Find the required source label src_label, in this example is instance, in this case the values are example.net:5222 and example.net:5269
  3. Apply regular expression regex, for us its “.*:(.*)” That says skip everying before “:” then capture/store everything past “:”. The brackets mean copy what is after the colon and put it in match #1
  4. Make a new label specified in dst_label, for us this is port
  5. Whatever is in replacement goes into dst_label. For this example it is “$1” which means match #1 in our regular expression in the label called port.

In short, the function captures everything after the colon in the instance label and puts that into a new label called port. It does this for each value that is returned into the first parameter.

This means I can use the {{port}} in my Grafana graph Legend and it will show 5222 or 5269 respectively. I have made the Legend “TCP {{port}}” to give the below result, but I could have used {{port}} in Grafana Legend and made the result “TCP $1” in the label_replace function to get the same result.

Grafana console showing the use of the label_replace function

10 November, 2021 06:25AM by Dropbear Blog

November 09, 2021

hackergotchi for Benjamin Mako Hill

Benjamin Mako Hill

The Hidden Costs of Requiring Accounts

Should online communities require people to create accounts before participating?

This question has been a source of disagreement among people who start or manage online communities for decades. Requiring accounts makes some sense since users contributing without accounts are a common source of vandalism, harassment, and low quality content. In theory, creating an account can deter these kinds of attacks while still making it pretty quick and easy for newcomers to join. Also, an account requirement seems unlikely to affect contributors who already have accounts and are typically the source of most valuable contributions. Creating accounts might even help community members build deeper relationships and commitments to the group in ways that lead them to stick around longer and contribute more.

In a new paper published in Communication Research, I worked with Aaron Shaw provide an answer. We analyze data from “natural experiments” that occurred when 136 wikis on Fandom.com started requiring user accounts. Although we find strong evidence that the account requirements deterred low quality contributions, this came at a substantial (and usually hidden) cost: a much larger decrease in high quality contributions. Surprisingly, the cost includes “lost” contributions from community members who had accounts already, but whose activity appears to have been catalyzed by the (often low quality) contributions from those without accounts.


A version of this post was first posted on the Community Data Science blog.

The full citation for the paper is: Hill, Benjamin Mako, and Aaron Shaw. 2020. “The Hidden Costs of Requiring Accounts: Quasi-Experimental Evidence from Peer Production.” Communication Research, 48 (6): 771–95. https://doi.org/10.1177/0093650220910345.

If you do not have access to the paywalled journal, please check out this pre-print or get in touch with us. We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.

09 November, 2021 07:55PM by Benjamin Mako Hill

hackergotchi for Joachim Breitner

Joachim Breitner

How to audit an Internet Computer canister

I was recently called upon by Origyn to audit the source code of some of their Internet Computer canisters (“canisters” are services or smart contracts on the Internet Computer), which were written in the Motoko programming language. Both the application model of the Internet Computer as well as Motoko bring with them their own particular pitfalls and possible sources for bugs. So given that I was involved in the creation of both, they reached out to me.

In the course of that audit work I collected a list of things to watch out for, and general advice around them. Origyn generously allowed me to share that list here, in the hope that it will be helpful to the wider community.

Inter-canister calls

The Internet Computer system provides inter-canister communication that follows the actor model: Inter-canister calls are implemented via two asynchronous messages, one to initiate the call, and one to return the response. Canisters process messages atomically (and roll back upon certain error conditions), but not complete calls. This makes programming with inter-canister calls error-prone. Possible common sources for bugs, vulnerabilities or simply unexpected behavior are:

  • Reading global state before issuing an inter-canister call, and assuming it    to still hold when the call comes back.

  • Changing global state before issuing an inter-canister call, changing it again    in the response handler, but assuming nothing else changes the state in    between (reentrancy).

  • Changing global state before issuing an inter-canister call, and not    handling failures correctly, e.g. when the code handling the callback rolls    backs.

If you find such pattern in your code, you should analyze if a malicious party can trigger them, and assess the severity that effect

These issues apply to all canisters, and are not Motoko-specific.

Rollbacks

Even in the absence of inter-canister calls the behavior of rollbacks can be surprising. In particular, rejecting (i.e. throw) does not rollback state changes done before, while trapping (e.g. Debug.trap, assert …, out of cycle conditions) does.

Therefore, one should check all public update call entry points for unwanted state changes or unwanted rollbacks. In particular, look for methods (or rather, messages, i.e. the code between commit points) where a state change is followed by a throw.

This issues apply to all canisters, and are not Motoko-specific, although other CDKs may not turn exceptions into rejects (which don’t roll back).

Talking to malicious canisters

Talking to untrustworthy canisters can be risky, for the following (likely incomplete) reasons:

  • The other canister can withhold a response. Although the bidirectional   messaging paradigm of the Internet Computer was designed to guarantee a   response eventually, the other party can busy-loop for as long as they are   willing to pay for before responding. Worse, there are ways to deadlock a   canister.

  • The other canister can respond with invalidly encoded Candid. This will cause   a Motoko-implemented canister to trap in the reply handler, with no easy way   to recover. Other CDKs may give you better ways to handle invalid Candid, but even then you will have to worry about Candid cycle bombs that will cause your reply handler to trap. 

Many canisters do not even do inter-canister calls, or only call other trustwothy canisters. For the others, the impact of this needs to be carefully assessed.

Canister upgrade: overview

For most services it is crucial that canisters can be upgraded reliably. This can be broken down into the following aspects:

  1. Can the canister be upgraded at all?
  2. Will the canister upgrade retain all data?
  3. Can the canister be upgraded promptly?
  4. Is three a recovery plan for when upgrading is not possible?

Canister upgradeability

A canister that traps, for whatever reason, in its canister_preupgrade system method is no longer upgradeable. This is a major risk. The canister_preupgrade method of a Motoko canister consists of the developer-written code in any system func preupgrade() block, followed by the system-generated code that serializes the content of any stable var into a binary format, and then copies that to stable memory.

Since the Motoko-internal serialization code will first serialize into a scratch space in the main heap, and then copy that to stable memory, canisters with more than 2GB of live data will likely be unupgradeable. But this is unlikely the first limit:

The system imposes an instruction limit on upgrading a canister (spanning both canister_preupgrade and canister_postupgrade). This limit is a subnet configuration value, and sepearate (and likely higher) than the normal per-message limit, and not easily determined. If the canister’s live data becomes too large to be serialized within this limit, the canister becomes non-upgradeable.

This risk cannot be eliminated completely, as long as Motoko and Stable Variables are used. It can be mitigated by appropriate load testing:

Install a canister, fill it up with live data, and exercise the upgrade. If this succeeds with a live data set exceeding the expected amount of data by a margin, this risk is probably acceptable. Bonus points for adding functionality that will prevent the canister’s live data to increase above a certain size.

If this testing is to be done on a local replica, extra care needs to be taken to make sure the local replica actually performs instruction counting and has the same resource limits as the production subnet.

An alternative mitigation is to avoid canister_pre_upgrade as much as possible. This means no use of stable var (or restricted to small, fixed-size configuration data). All other data could be

  • mirrored off the canister (possibly off chain), and manually re-hydrated after an upgrade.
  • stored in stable memory manually, during each update call, using the ExperimentalStableMemory API. While this matches what high-assurance Rust canisters (e.g. the Internet Identity) do, This requires manual binary encoding of the data, and is marked experimental, so I cannot recommend this at the moment.
  • not put into a Motoko canister until Motoko has a scalable solution for stable variable (for example keeping them in stable memory permanently, with smart caching in main memory, and thus obliterating the need for pre-upgrade code.)

Data retention on upgrades

Obviously, all live data ought to be retained during upgrades. Motoko automatically ensures this for stable var data. But often canisters want to work with their data in a different format (e.g. in objects that are not shared and thus cannot be put in stable vars, such as HashMap or Buffer objects), and thus may follow following idiom:

stable var fooStable = …;
var foo = fooFromStable(fooStable);
system func preupgrade() { fooStable := fooToStable(foo); })
system func postupgrade() { fooStable := (empty); })

In this case, it is important to check that

  • All non-stable global vars, or global lets with mutable values, have a stable companion.
  • The assignments to foo and fooStable are not forgotten.
  • The fooToStable and fooFromStable form bijections.

An example would be HashMaps stored as arrays via Iter.toArray(….entries()) and HashMap.fromIter(….vals()).

It is worth pointiong out that a code view will only look at a single version of the code, but cannot check whether code changes will preserve data on upgrade. This can easily go wrong if the names and types of stable variables are changed in incompatible way. The upgrade may fail loudly in this cases, but in bad cases, the upgrade may even succeed, losing data along the way. This risk needs to be mitigated by thorough testing, and possibly backups (see below).

Prompt upgrades

Motoko and Rust canisters cannot be safely upgraded when they are still waiting for responses to inter-canister calls (the callback would eventually reach the new instance, and because of infelicities of the IC’s System API, could possibly call arbitrary internal functions). Therefore, the canister needs to be stopped before upgrading, and started again. If the inter-canister calls take a long time, this mean that upgrading may take a long time, which may be undesirable. Again, this risk is reduced if all calls are made to trustworthy canisters, and elevated when possibly untrustworthy canisters are called, directly or indirectly.

Backup and recovery

Because of the above risk around upgrades it is advisable to have a disaster recovery strategy. This could involve off-chain backups of all relevant data, so that it is possible to reinstall (not upgrade) the canister and re-upload all data.

Note that reinstall has the same issue as upgrade described above in “prompt upgrades”: It ought to be stopped first to be safe.

Note that the instruction limit for messages, as well as the message size limit, limit the amount of data returned. If the canister needs to hold more data than that, the backup query method might have to return chunks or deltas, with all the extra complexity that entails, e.g. state changes between downloading chunks.

If large data load testing is performed (as Irecommend anyways to test upgradeability), one can test whether the backup query method works within the resource limits.

Time is not strictly monotonic

The timestamps for “current time” that the Internet Computer provides to its canisters is guaranteed to be monotonic, but not strictly monotonic. It can return the same values, even in the same messages, as long as they are processed in the same block. It should therefore not be used to detect “happens-before” relations.

Instead of using and comparing time stamps to check whether Y has been performed after X happened last, introduce an explicit var y_done : Bool state, which is set to False by X and then to True by Y. When things become more complex, it will be easier to model that state via an enumeration with speaking tag names, and update this “state machine” along the way.

Another solution to this problem is to introduce a var v : Nat counter that you bump in every update method, and after each await. Now v is your canister’s state counter, and can be used like a timestamp in many ways.

While we are talking about time: The system time (typically) changes across an await. So if you do let now = Time.now() and then await, the value in now may no longer be what you want.

Wrapping arithmetic

The Nat64 data type, and the other fixed-width numeric types provide opt-in wrapping arithmetic (e.g. +%, fromIntWrap). Unless explicitly required by the current application, this should be avoided, as usually a too large or negatie value is a serious, unrecoverable logic error, and trapping is the best one can do.

Cycle balance drain attacks

Because of the IC’s “canister pays” model, all canisters are prone to DoS attacks by draining their cycle balance, and this risk needs to be taken into account.

The most elementary mitigation strategy is to monitor the cycle balance of canisters and keep it far from the (configurable) freezing threshold.

On the raw IC-level, further mitigation strategies are possible:

  • If all update calls are authenticated, perform this authentication as quickly as possible, possibly before decoding the caller’s argument. This way, a cycle drain attack by an unauthenticated attacker is less effective (but still possible).

  • Additionally, implementing the canister_inspect_message system method allows the above checks to be performed before the message even is accepted by the Internet Computer. But it does not defend against inter-canister messages and is therefore not a complete solution.

  • If an attack from an authenticated user (e.g. a stakeholder) is to be expected, the above methods are not effective, and an effective defense might require relatively involved additional program logic (e.g. per-caller statistics) to detect such an attack, and react (e.g. rate-limiting).

  • Such defenses are pointless if there is only a single method where they do not apply (e.g. an unauthenticated user registration method). If the application is inherently attackable this way, it is not worth the bother to raise defenses for other methods.

   Related: A justification why the Internet Identity does not use canister_inspect_message)

A motoko-implemented canister currently cannot perform most of these defenses: Argument decoding happens unconditionally before any user code that may reject a message based on the caller, and canister_inspect_message is not supported. Furthermore, Candid decoding is not very cycle defensive, and one should assume that it is possible to construct Candid messages that require many instructions to decode, even for “simple” argument type signatures.

The conclusion for the audited canisters is to rely on monitoring to keep the cycle blance up, even during an attack, if the expense can be born, and maybe pray for IC-level DoS protections to kick in.

Large data attacks

Another DoS attack vector exists if public methods allow untrustworthy users to send data of unlimited size that is persisted in the canister memory. Because of the translation of async-await code into multiple message handlers, this applies not only to data that is obviously stored in global state, but also local data that is live across an await point.

The effectiveness of such attacks is limited by the Internet Computer’s message size limit, which is in the order of a few megabytes, but many of those also add up.

The problem becomes much worse if a method has an argument type that allows a Candid space bomb: It is possible to encode very large vectors with all values null in Candid, so if any method has an argument of type [Null] or [?t], a small message will expand to a large value in the Motoko heap.

Other types to watch out:

  • Nat and Int: This is an unbounded natural number, and thus can be arbitrarily large. The Motoko representation will however not be much larger than the Candid encoding (so this does not qualify as a space bomb).

   It is still advisable to check if the number is reasonable in size before storing it or doing an await. For example, when it denotes an index in an array, throw early if it exceeds the size of the array; if it denotes a token amount to transfer, check it against the available balance, if it denotes time, check it against reasonable bounds.

  • Principal: A Principal is effectively a Blob. The Interface   specification says that principals are at most 29   bytes in   length, but the Motoko Candid   decoder   does not check that currently (fixed in the next version of Motoko). Until then, a Principal passed as an   argument can be large (the principal in msg.caller is system-provided and   thus safe). If you cannot wait for the fix to reach you, manually check the size of the princial (via   Principal.toBlob) before doing the await.

Shadowing of msg or caller

Don’t use the same name for the “message context” of the enclosing actor and the methods of the canister: It is dangerous to write shared(msg) actor, because now msg is in scope across all public methods. As long as these also use public shared(msg) func …, and thus shadow the outer msg, it is correct, but it if one accidentially omits or mis-types the msg, no compiler error would occur, but suddenly msg.caller would now be the original controller, likely defeating an important authorization step.

Instead, write shared(init_msg) actor or shared({caller = controller}) actor to avoid using msg.

Conclusion

If you write a “serious” canister, whether in Motoko or not, it is worth to go through the code and watch out for these patterns. Or better, have someone else review your code, as it may be hard to spot issues in your own code.

Unfortunately, such a list is never complete, and there are surely more ways to screw up your code – in addition to all the non-IC-specific ways in which code can be wrong. Still, things get done out there, so best of luck!

09 November, 2021 05:34PM by Joachim Breitner (mail@joachim-breitner.de)

Aloïs Micard

Laravel dynamic SMTP mail configuration

Hello friend… It has been a while. I have been very busy lately with work, open source and life that I didn’t find the energy to write a blog post. Despite having some good ideas, I wasn’t really in the mood. Hopefully, I now have the energy and the subject to make a good blog post: let’s talk about Laravel and emails! 1. Laravel and SMTP 1.1. Configuration Laravel SMTP Mail support is truly awesome and work out-of-the-box without requiring anything more than a few env variables:

09 November, 2021 12:00AM

November 08, 2021

Dima Kogan

mrcal 2.0: triangulation and stereo

mrcal is my big toolkit for geometric computer vision: making models (camera calibration) and using models (mapping, ranging, etc).

Since the release of mrcal 1.0 back in February I've been busy using the tools in the field, fixing things and improving things. Today I'm happy to finally be able to announce the release of mrcal 2.0.

A big part of this release is maintenance and cleanup that resulted from me heavily using the tools over the course of this past year, and improving whatever was bugging me. The most notable result of that effort, is that splined models are no longer "experimental". They work well and they're awesome. Go try them.

And there're a number of new features, most notably nice dense stereo support and nice sparse triangulation support (with uncertainty propagation!) These are awesome. Go try them.

As before, the tour of mrcal provides a good overview of the capabilities of the toolkit, and is a good place to start reading the documentation. Reading these docs would be very illuminating for anybody that calibrates cameras, even for those that have no intent to actually use the mrcal tools.

Let me know if you try it out!

The most list of most notable improvements, from the release notes:

08 November, 2021 10:40PM by Dima Kogan

Enrico Zini

An educational debugging session

This morning we realised that a test case failed on Fedora 34 only (the link is in Italian) and we set to debugging.

The initial analysis

This is the initial reproducer:

$ PROJ_DEBUG=3 python setup.py test
test_recipe (tests.test_litota3.TestLITOTA3NordArkimetIFS) ... pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: Open of /lib64/../share/proj/proj.db failed
pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: no database context specified
Cannot instantiate source_crs
EXCEPTION in py_coast(): ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
ERROR

Note that opening /lib64/../share/proj/proj.db sometimes succeeds, sometimes fails. It's some kind of Schrödinger path, which works or not depending on how you observe it:

# ls -lad /lib64
lrwxrwxrwx 1 1000 1000 9 Jan 26  2021 /lib64 -> usr/lib64

$ ls -la /lib64/../share/proj/proj.db
-rw-r--r-- 1 root root 8925184 Jan 28  2021 /lib64/../share/proj/proj.db

$ cd /lib64/../share/proj/

$ cd /lib64
$ cd ..
$ cd share
-bash: cd: share: No such file or directory

And indeed, stat(2) finds it, and sqlite doesn't (the file is a sqlite database):

$ stat /lib64/../share/proj/proj.db
  File: /lib64/../share/proj/proj.db
  Size: 8925184     Blocks: 17432      IO Block: 4096   regular file
Device: 33h/51d Inode: 56907       Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-11-08 14:09:12.334350779 +0100
Modify: 2021-01-28 05:38:11.000000000 +0100
Change: 2021-11-08 13:42:51.758874327 +0100
 Birth: 2021-11-08 13:42:51.710874051 +0100

$ sqlite3 /lib64/../share/proj/proj.db
Error: unable to open database "/lib64/../share/proj/proj.db": unable to open database file

A minimal reproducer

Later on we started stripping layers of code towards a minimal reproducer: here it is. It works or doesn't work depending on whether proj is linked explicitly, or via MagPlus:

$ cat tc.cc
#include <magics/ProjP.h>

int main() {
    magics::ProjP p("EPSG:4326", "+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs");
    return 0;
}

$ g++ -o tc  tc.cc -I/usr/include/magics  -lMagPlus
$ ./tc
proj_create: Open of /lib64/../share/proj/proj.db failed
proj_create: no database context specified
terminate called after throwing an instance of 'magics::MagicsException'
  what():  ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
Aborted (core dumped)

$ g++ -o tc  tc.cc -I/usr/include/magics -lproj  -lMagPlus
$ ./tc

What is going on here?

A difference between the two is the path used to link to libproj.so:

$ ldd ./tc | grep proj
    libproj.so.19 => /lib64/libproj.so.19 (0x00007fd4919fb000)
$ g++ -o tc  tc.cc -I/usr/include/magics   -lMagPlus
$ ldd ./tc | grep proj
    libproj.so.19 => /lib64/../lib64/libproj.so.19 (0x00007f6d1051b000)

Common sense screams that this should not matter, but we chased an intuition and found that one of the ways proj looks for its database is relative to its shared library.

Indeed, gdb in hand, that dladdr call returns /lib64/../lib64/libproj.so.19.

From /lib64/../lib64/libproj.so.19, proj strips two paths from the end, presumably to pass from something like /something/usr/lib/libproj.so to /something/usr.

So, dladdr returns /lib64/../lib64/libproj.so.19, which becomes /lib64/../, which becomes /lib64/../share/proj/proj.db, which exists on the file system and is used as a path to the database.

But depending how you look at it, that path might or might not be valid: it passes the stat(2) check that stops the lookup for candidate paths, but sqlite is unable to open it.

Why does the other path work?

By linking libproj.so in the other way, dladdr returns /lib64/libproj.so.19, which becomes /share/proj/proj.db, which doesn't exist, which triggers a fallback to a PROJ_LIB constant defined at compile time, which is a path that works no matter how you look at it.

Why that weird path with libMagPlus?

To complete the picture, we found that libMagPlus.so is packaged with a rpath set, which is known to cause trouble

# readelf -d /usr/lib64/libMagPlus.so|grep rpath
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib64]

The workaround

We found that one can set PROJ_LIB in the environment to override the normal proj database lookup. Building on that, we came up with a simple way to override it on Fedora 34 only:

    if distro is not None and distro.linux_distribution()[:2] == ("Fedora", "34") and "PROJ_LIB" not in os.environ:
         self.env_overrides["PROJ_LIB"] = "/usr/share/proj/"

This has been a most edifying and educational debugging session, with only the necessary modicum of curses and swearwords. Working in a team of excellent people really helps.

08 November, 2021 02:20PM