Raspberry Pi 4B
Wi-Fi

Jim Carter, 2023-08-17

I have multiple issues with my Wi-Fi access point:

Since my Wi-Fi is used by guests with doubtful security, I want my Wi-Fi access point to have my firewall on it. That implies that it will run on one of my machines, not a commercial appliance. I've been doing this for several years, using hostapd by Jouni Malinen, currently version 2.10.
In the original configuration, the AP (and most other hosts) are at one end of my house. At the other end are two security cameras, which associate unreliably with the AP due to distance and intervening exterior walls. I'm using a commercial appliance, a TP-Link AC750 Wi-Fi Range Extender model RE220, but it has issues, and I would like to use my own machine for this also. But various problems have gotten in the way.
At rare intervals, since I've been using NICs with MediaTek chipsets (Terow Mediatek AC-1200Mbps and Alfa AWUS036ACM), the CPU (x86_64) mysteriously freezes, with complete silence in the log files. This kind of thing has been reported for decades; the earliest report I saw was from 2007; and the association with MediaTek is likely just coincidence. Others blame an unfixed or unfixable bug in Linux memory management. But I moved the AP off my main router (Jacinth) onto a Raspberry Pi 3B (Holly, aarch64/ARM). This got rid of the freezes for several years, but now they're back. One person solved a similar issue by upgrading to a RPi 4B, and I've been looking for an excuse to do that upgrade. The new RPi 4B is called Beaver and is the subject of this hardware review.

Here is a list of the NICs I'm working with:

Nbr	Brand	Firmware	MAC Address
0	Terow AC-1200Mbps	Mediatek	00:13:ef:5f:0c:3c
1	Alfa AWUS036ACM	Mediatek	00:c0:ca:b0:60:4b
2	Cudy WU1400AC	Realtek	b4:4b:d6:27:de:88
3	Alfa-rt AC1200	Realtek	00:c0:ca:a8:4c:7e
4	Intel 3168NGW	Intel	60:f6:77:76:21:63
6	TP-Link TL-QN722N	Atheros	90:f6:52:08:c8:4b
7	Za-pai	Ralink	00:e1:80:67:84:34

1-letter codes for AP hosts, NICs and places:

Hosts:
- B = Beaver (Raspberry Pi 4B)
- H = Holly (Raspberry Pi 3B)
- P = Piki (Raspberry Pi 3B)
NICs:
- A = Alfa AWUS036ACM
- T = Terow AC-1200
- Z = Za-Pai (vendor name is not obvious, made in China)
Places:
- C = Compact Office where Jacinth lives, west end of living room
- J = Jim's desk, west end of office
- Q = Top of curio cabinet, about 2.5 meters up, east end of living room. close to the center of the house

In prior unsuccessfui attempts to make a range extender, I got into packet storms that caused uninvolved Wi-Fi packets to be collided with or otherwise forced off the aether. The code word packetvore denotes an episode of very high loss of Wi-Fi packets.

First Successful Deployment

Summary:

In all cases the uplink was IEEE 802.3 wired Ethernet.
The former configuration had the TP-Link range extender in the laundry area.
It's impractical to get wired Ethernet to the laundry area, but I moved Piki + Za-Pai to the top of the curio cabinet (P/Z/Q) where I can run Ethernet (ugly). Its SSID is CouchNet-EXT (imitating the TP-Link range extender) and it serves only Ring camera 2, acceptably. It was not the focus of these tests.
The combinations under test used the SSID of CouchNet. Only one was running at a time.
Holly + Terow AC1200 on Jim's desk (H/T/J): 3 trials, 2 near perfect, 1 packetvore.
Beaver + Alfa AWUS036ACM on Compact Office (B/A/C): 2 trials, one set of major packetvores and another less aggressive.
Holly + Alfa AWUS036ACM on Jim's desk (H/A/J): 2 near perfect trials, one was 16hr 20min.
Beaver + Terow AC1200 on Compact Office (B/T/C): 1 test of 4hr, one pair of major glitches which were brief, not the usual packetvore.

Discussion: All combinations of host and NIC are capable of running for long periods with high quality Wi-Fi. All but one combination suffered packetvore attacks or major glitches. It is definitely not true that one or the other NIC fails consistently (both seem similar), and the same for the two hosts. Patterns of packet loss vary from day to day, or even hour to hour. I would be very inclined to attribute the problems to malign outside influences, except that neighbors' beacons are all under -75dBm, way below local levels, and it's plausible that the neighbors' stations have similar power, far from enough to disrupt local communication.

Conclusion: The H/A/J combination has been trouble-free for a week, cross fingers. By abandoning the range extender configuration and using wired Ethernet for all APs, I think I'm now on the right track.

Second Improved Deployment

After a week of hiatus I'm going to violate the policy if it isn't broken, don't fix it. The initial status is:
- Geneve: is disabled and inactive on all AP hosts.
- Uplink: for all, IEEE 802.3 wired Ethernet (en0) is used. Onboard Wi-fi (rad9) is not associated.
- Beaver: hostapd.J is inactive, SSID would have been CouchNet, NIC is Terow AC-1200 but, as is common, it has gone catatonic.
- Holly: hostapd.J is active, SSID is CouchNet, NIC is Alfa AWUS036ACM on rad1 (H/A/J), 10 stations are connected.
- Piki: hostapd.J is active, SSID is CouchNet-EXT, NIC is Za-Pai on rad7 (P/Z/Q), 1 station (ringc2) is connected.
The planned change is to have all 3 APs active, with SSID CouchNet, and with Beaver and the Alfa atop the curio cabinet (B/A/Q). A future improvement will be to buy 2 more Alfa AWUS036ACM's, replacing the Terow AC-1200 (unreliable) and the Za-Pai (it's za-pai). Another improvement is a second Raspberry Pi 4B serving Jim's desk; see this review's front page for good words about the RPi 4B in the desktop replacement role.

I want to do a signal strength survey before and after moving the APs to their new locations. Former locations of the APs and NICS:

Beaver + Terow AC-1200: Jacinth's cabinet (B/T/C)
Holly + Alfa AWUS036ACM: Office (H/A/J)
Piki + Za-Pai: Atop curio cabinet (P/Z/Q)

Planned future locations and NIC assignments:

Beaver + Alfa AWUS036ACM: Atop curio cabinet (B/A/Q).
Holly + Za-Pai: Office, Jim's desk (H/Z/J)
Piki + Terow AC-1200: Jacinth's cabinet (P/T/C)

These will be the sites to measure at. Ring cameras report their own signal strength; at other sites I'm using Xena (laptop) to measure. The procedure on Xena will be to disconnect Wi-Fi, wait 10sec, reconnect (giving it the chance to switch to the strongest AP), then run iwconfig $NIC 3 times 15sec apart, and report the average signal level, always negative, in dBm. It also reports which AP it is associated with.

The Couch, near Jacinth's cabinet
Office, Alice's desk, east side
Dining room, south side
Breakfast room, south side
Laundry room
Garage, on workbench
Jim's bedroom (2nd floor)
Alice's bedroom (2nd floor)
Master bathroom (2nd floor)
RingC1, patio outside living room
RingC2, breakfast room through exterior wall
RingC3, front door outside

Data to record at each site:

Signal strength reported by laptop or Ring camera
Which AP is being used
Signal strength reported by Selen's WiFi Analyzer (Android) for all three NICs
On the laptop, iwconfig $NIC reports the BSSID, i.e. the MAC address of the AP in use, and its signal strength. Ring cameras report their own signal strength. On the AP, hostapd_cli -i $NIC all_sta lists all associated stations (reformat it neatly and look for the Ring cameras).

Site	Initial			Final
	Signal	AP Used	Scan	Signal	AP Used	Scan
Couch	-48	H/A/J	A=-44 Z=-57	-41	P/T/C	A=-44 T=-33 Z=-55
Office	-41	H/A/J	A=-34 Z=-67	-55	P/T/C	A=-47 T=-31 Z=-44
Dining Rm	-66	H/A/J	A=-57 Z=-61	-44	B/A/Q	A=-39 T=-58 Z=-71
Breakfast Rm	-71	H/A/J	A=-67 Z=-53	-54	B/A/Q	A=-56 T=-61 Z=-76
Laundry	-76	H/A/J	A=-68 Z=-71	-56	B/A/Q	A=-64 T=-62 Z=-83
Garage	-63	H/A/J	A=-60 Z=-80	-59	P/T/C	A=-63 T=-73 Z=-77
Jim's Bed	-56	H/A/J	A=-45 Z=-66	-61	P/T/C	A=-60 T=-62 Z=-68
Alice's Bed	-60	H/A/J	A=-55 Z=-64	-53	B/A/Q	A=-56 T=-56 Z=-73
Bathroom	-73	H/A/J	A=-71 Z=-61	-58	B/A/Q	A=-54 T=-44 Z=-82
RingC1	-59	H/A/J		-59	P/T/C
RingC2	-71	P/Z/Q		-59	B/A/Q
RingC3	-49	H/A/J		-52	P/T/C

Wi-Fi is working.
- All surveyed locations have acceptable signal strength ranging from -41 to -61 dBm. (-67 dBm is a recommended lower bound.)
- Networking is normal; whichever AP a station is connected to, it can connect to other hosts including offsite, and other hosts can connect to it.
- The roaming survey machine (Xena) was seen to switch APs with the loss of one ping packet during reauthentication.
- Setting up and testiong IEEE 802.11i pre-authentication is for the future. This would do 802.11x authentication on the LAN via the leaving AP, before the station switches to the new AP, so the switchover goes a lot quicker and fewer packets should be lost.
- The new configuration has been working for two weeks with no problems except that the Ethernet cable to the curio cabinet needs to be routed in a less ugly way.

The Packetvore is Back!

I received and installed the new Alfa AWUS036ACM NICs, replacing the Terow AC-1200 and the Za-Pai. No problems during or after installation, except…

Before and after installing the new Alfa NICs, I had a small number of packetvore attacks, i.e. high packet loss rates, in some cases belived to be less than 100%, but sometimes I'm pretty sure it's 100%, which lasted several minutes at least. When the station was induced to connect to a different AP, it communicated normally, and the AP returned to normal function (but I don't know how long that took).

I'm trying to gather information about what's happening. The first job is to set up a test station connecting to each AP, and I'll use the AP hosts themselves: Beaver → Holly → Piki → Beaver, sending from their onboard NICs, all rad9. /etc/sysconfig/network/ifcfg-rad9 contains: (replacing value keywords with their numeric values)

STARTMODE='auto'
BOOTPROTO='static'
IPADDR='my fixed IPv4 adr/nbr of bits'
IPADDR_0='my fixed IPv6 adr/nbr of bits'
WIRELESS_ESSID='CouchNet'
WIRELESS_WPA_PSK='WouldntYouLikeToKnow'
WIRELESS_AP='MAC adr of AP NIC on the target'

Since a route with more bits is preferred, a host (maximum length) route /etc/sysconfig/network/routes-rad9 contains:

# Dest			Via     NMask   Ifc     Options
Target Wi-Fi IPv4 adr/32	-       -       -
Target Wi-Fi IPv6 adr/128	-       -       -

No special routes on the target. The station

Range Extender (Failed)

A Wi-Fi range extender includes an access point, which other stations connect to, plus a client (station) that can forward traffic between the connecting stations and some other AP, usually the main Wi-Fi router. I wanted to junk the commercial range extender, TP-Link AC750 Wi-Fi Range Extender model RE220, and replace it with my own Raspberry Pi 3B. But packet loops (code name: packetvore) are easy to create and hard to diagnose and mitigate, and I eventually abandoned the range extender project, switching to standalone APs on the LAN via wired Ethernet.

So this is a long collection of notes about work on the range extender which happened before the two successful deployments described previously. It has not been edited very much for comprehensibility, except for a few section headings; it is preserved because methods and data may be useful in the future for something. People whose interest is in the working deployments can stop reading here.

Range Extender Basic Design

This design produces a working range extender, except for the minor detail of packetvore attacks at random intervals.

Each Raspberry Pi range extender has one external USB NIC to be the access point that stations connect to, and the uplink is its internal Wi-Fi, connected to an AP that's directly on the LAN.
The AP is in a bridge. For the LAN-resident APs the wired Ethernet NIC is also in this bridge, and the effect is as if the connecting stations were directly on the LAN. No routing is needed.
The firmware in the TP-Link range extender assigned IPs randomly from a separate address range, and it routed between that range and the LAN. Stations could originate connections to and beyond the LAN, but LAN clients could not figure out the randomly assigned IP addresses of the stations, so could not originate connections to them. This arrangement was one of the reasons I junked the TP-Link range extender.
For the RPi range extender, it would be really nice if the Wi-Fi uplink NIC could go in the bridge just like wired Ethernet. But there are corner cases where 802.11 Wi-Fi and bridging conflict, such as multicast, and the 802.11 driver maintainers got tired of nested kludgy workarounds, and introduced a control bit in the bridge driver so drivers like theirs could declare the interface unbridgable. Sabotaging this bit is simple, but I got tired of maintaining a hacked driver that tainted the kernel, and I gave that up.
Instead, there's a Geneve tunnel (it's bidirectional) with one end in the range extender's bridge and the other in the bridge of the LAN-resident AP (referred to as the Geneve server), which puts the stations effectively on the LAN, same as with wired Ethernet. But…
Geneve bearer packets cannot successfully be borne by the Geneve tunnel (a chicken and egg issue). So I use policy routing to divert those bearer packets to the RPI's internal NIC, which passes them to the Geneve server. Being addressed to that host (rather than a generic LAN host), the bearer packets are swallowed by the Geneve bearer port, and the payload packets appear on its bridge. This actually works, usually.
Initially there is only one extender and one Geneve server, but as much as feasible, the setup program is designed to handle multiple extenders and servers.
Geneve tunnel endpoint IPs and channel ID number (24 bits) must be specified at creation. I'm using the IPv4 addresses of the bridges in the extender and the Geneve server. (IPv6 would also work.) The channel ID must be the same at both ends. I take the last octet of the IPs of the extender and server, and multiply the smaller by 0x100 and add the larger.

Non-Beacon Mode

The first problem I encountered was, on NICs #2, 3 and 6, hostapd starts up, puts the NIC in the bridge, claims to have turned on AP mode, but WiFi Analyzer for Android doesn't report any beacons for that NIC (it has a debug SSID so is distinguishable), and you can't associate with it. Message: hostapd.J[13688]: rad0: AP-ENABLED. This is the normal message for successfully starting up hostapd. Google searches for my symptom reveal nothing.

Summary of iw list for various NICs as seen on Beaver:

Onboard NIC: This is phy0. Supported ciphers: WEP40/104, TKIP, CCIP-128, CMAC. Modes: managed, AP, etc. Bands: 2.4GHz, 5GHz. Valid interface combinations: 1 or 2 managed, or 1 managed + 1 AP, or just 1 AP, or various P2P (no IBSS). Must be on the same channel I think. Firmware is onboard (nothing was uploaded during driver init).
Nic #0 (Terow, Mediatek): Supported ciphers: WEP40/104, TKIP, CCMP-128/256, GCMP-128/256, CMAC, CMAC-256, GMAC-128/256. Modes: managed, AP, etc. Bands: 2.4GHz, 5GHz. Valid interface combinations: number of Managed + APs + etc. max of 2, must be on the same channel I think. Firmware is onboard (nothing was uploaded during driver init).
Nic #2 (Cudy, Realtek): Supported ciphers as for #0 (Terow). Modes: managed, AP, etc. Bands: 2.4GHz, 5GHz. Valid interface combinations: only one interface at a time. Firmware is onboard (nothing was uploaded during driver init).

I had trouble to get several NICs to act as APs on Piki, so I tried them on Beaver. Outcomes:

0, Terow: Normally on Beaver; it works. From starting hostapd.J it takes about 60sec for the beacons to start coming out.
1. Alfa-mt: In production on Holly, can't test today, but worked on Beaver in the past.
2. Cudy: Beaver has the Realtek driver (rtw_8822bu) but Piki doesn't. Idiot, SSID is CouchNet, should be CouchNet-Beaver during debugging. Fixed all bogus SSIDs. AP enabled, But CouchNet-Beaver beacon is not being transmitted.
3. Alfa-rt: Has driver (Piki lacks).
7. Za-pai Ralink: It has an in-kernel driver.

I added to generic.incl: beacon_int=100; start_disabled=0
Empirically, start_disabled=1 makes the Terow not send beacons; 0 lets it send them. The Cudy still doesn't send them in either case.

Learning to use hostapd_cli: Only useful-looking commands are shown.

Command line format is: hostapd_cli [-options] [command...]
The usage is displayed with option -h. Also command 'help' but hostapd has bo be alive.
ping -- See if hostapd is running.
status -- Show interface status, including SSID, BSSID, number of associated stations, beacon interval, and lots of others.
get_config -- Show major configuration file variables.
disable, enable -- Dis/enable the current interface. The Terow takes a long time like over 1 min. to start sending beacons after enable.
update_beacon -- Not sure what info is refreshed.
poll_sta $addr -- Check connectivity to a station with a null QoS frame.
req_beacon $addr -- Ask a station to report about (the last?) beacon frame, e.g. signal strength.

hostapd_cli on Cudy: It is not putting out any beacons.

status -- looks identical to Terow.
get_config -- is identical to Terow.
hostapd_cli update_beacon -- Says OK. Did not make beacons appear.
dis/enable -- Says OK to both. Even after about 2min, no beacons were produced.
Conclusion: no clue was discovered, why the Cudy doesn't send beacons.

Tidbit on a forum:
Raspberry Pi 4 hostapd hotspot not visible, OP CybeX, 2019-11-21. Options to produce debug output:
/usr/sbin/hostapd /etc/hostapd/hostapd.conf -dd | tee /tmp/hostapd.log
It looks like his issue wasn't missing beacons, it was a screwed up DHCP server. For reference, my symptom is, my AP doesn't appear in the list of selectable SSIDs in Android 12 or NetworkManager, and my AP doesn't show up in WiFi Analyzer (Android). The OP actually reports symptoms matching mine but described with less detail, and I don't see a DHCP issue in his initial report, though he says he solved his problem by disabling dhcpcd.

From another forum post: maybe rfkill didn't un-kill it. rfkill list reports that all phy's plus Bluetooth are not soft/hard blocked. For the identifier it wants the number in the list (not the phy name nor the interface name). But rfkill unblock 9 didn't start the beacons.

Running hostapd with the -dd option (debug output), see command line above. What I found:

It started. Captured 361 lines over about 30sec before I killed it.
We're on phy8.
Set mode ifindex 23 iftype 3 (AP) ; setting up AP(rad2)
Changes country code from US to US
Added expected channels (2.4GHz and 5GHz), using channel 11 (2462 MHz, this is the explicitly set channel)
Got a hex dump of beacon head and tail. Interval 100 (msec), beacon_rate=0 (watch for this on Terow)
Sets at least one key (group key?)
Interface state changed to ENABLED, Setup of interface done.
nl80211: Drv Event 15 (NL80211_CMD_START_AP) received for rad2
rad2: nl80211: Ignored unknown event (cmd=15) (???)
Frames are being sent and received but I can't tell who the peer is, could be userspace (hostapd) chattering with the NIC.
No sign of beacon frames in the log or on WiFi Analyzer.

Repeating the above with Terow (with beacons) and comparing:

Terow only: Beacons appeared after 60 to 90 sec after startup, monitored by WiFi Analyzer (Android).
I was able to associate with my phone.
Captured 699 lines of debug output (from Terow).
Initial setup is similar for both.
Terow can test for usable DFS channels, Cudy just skips them.
Both use channel 11, explicitly configured.
Creating beacon pkt: head identical except AP MAC adr/BSSID; tails differ; length 151 vs. 155, some but not all content differs.
beacon_ies[2] is 00 for Cudy, 04 for Terow; similar for next 2 ies's
Cudy RX frame, sa=52:12:b4:b7:8f:66 (not in /etc/ethers). Terow received one, different length, different sa, diff sqp_ctrl.
Then Terow sent a CMD_FRAME (Cudy didn't).
Terow got authentication request from Selen (per MAC adr).
I can't see any difference related to beacons. I think the beacon frames are sent autonomously (beacon_int=100 for both), with no per-beacon evidence sent to hostapd.

Digging through /usr/share/doc/packages/hostapd/hostapd.conf, annotated config file with default or recommended values for every parameter, and extracting everything mentioning beacons. Keywords beginning with # are defaults that normally would not be set explicitly. Many items described here as adding some element to the beacon frame, also add it to the Probe Response frame.

#local_pwr_constraint=3 -- Add power constraint in beacon.
beacon_int=100 -- Beacon interval in units of 1.024 msec.
#beacon_rate=10 -- Net speed used to send beacons, unit is (100 kbps).
ignore_broadcast_ssid=0 -- Hide SSID (send empty value) in beacons. Default value of 0 means to show the SSID.
#vendor_elements=dd0411223301 -- Add additional vendor specific elements in the beacon frame.
#max_listen_interval=100 -- How long (in beacon periods) a STA may sleep silently. Default is 65535 which means no limit.
#start_disabled=0 -- Start the AP with beaconing disabled by default. Empirically on Terow, value 0 -> send beacons, 1 -> keep silent.
#bss_load_update_period=50 -- Include BSS Load element in beacon.
#bss_load_test=12:80:20000 -- Configure fake data in the BSS Load element, for testing.
#beacon_prot=0 -- Omit Management Frame Protection on beacons. 0 is the default. To use this, set value 1 and turn on Management Frame Protection (ieee80211w != 0).
#wps_application_ext= -- Add an Application Extension attribute to beacon frames.
#roaming_consortium=021122 -- Add Roaming Consortium OI(s) to beacons.
#rrm_beacon_report=1 -- Enable beacon report via radio measurements

With start_disabled=1 the NIC will start with no beacons (our symptom), so there must be a command to start the beacons. The only commands for hostapd_cli that mention beacon in the help are update_beacon (update the content of the beacon frame), and req_beacon (send a Beacon report request to a station). I wonder if you're supposed to just change the value to start_disabled=0? The command line would be:
hostapd_cli [-i $interface] set start_disabled 0

Steps in testing the above:

hostapd was not running. Unplugged Terow, plugged in Cudy.
Started hostapd.J (AP-ENABLED).
Waited 120 sec, should be ready to emit beacons by now. None were seen.
hostapd_cli get start_disabled -- it says FAIL.
hostapd_cli get beacon_int -- it says FAIL. beacon_int is among the values shown by hostapd_cli status .
hostapd_cli update_beacon -- says OK. No beacons seen.

Searching in the source code of hostapd-2.10 .

In hostapd.c, in hostapd_setup_bss(), if conf→start_disabled is true (and other conditions), it returns -1 signifying that the BSS was not completely set up.
If conf→start_disabled is false, it kicks off stations from the previous generation, inits WPA keys, then sets operstate to 1, and finally returns 0.
Other code that sets operstate:
- ./src/drivers/driver_nl80211.c sets drv→operstate to 0 (dormant) or 1 (up). Not clear how the driver finds out that it's supposed to be up.
Conclusion: I've found how to set opermode=0, but not to set opermode=1.

But I got a hint somewhere: iw may be my friend, specifically iw dev $IFC ap start (or stop). No, it gives a usage message. Looks like missing options.

After considerable struggle I junked the Cudy and Alfa-1200 because the drivers are out of kernel, and I found the Za-Pai Ralink NIC, which has an in-kernel driver and which emits beacons as it should; tested on Beaver.

Packetvore Attacks!

Now bringing up the Za-Pai up on Piki.

Driver tried to init the hardware but got: (udev-worker): page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Try rebooting. Success (after about 2min). Seems normal. geneve.J is active/exited.
What happens if Selen connects to Piki? Android thinks it's connected.
Selen looks at http://jfcarter.net:1447/stuff -- times out. This looks like a MTU problem, which is where I left off. [Red herring.] HTTP to a LAN host seems to be unaffected; it would know the MTU. Suspicion changes to 400 status and messed up handling of it. It works better with https schema.
Time for restarter check, will Piki screw up? Everything passed except network-dns-wait (restarted OK).
Digging for MTU problems. Here's what happens when I download (curl) these URLs to Piki. All were downloaded OK, no obvious packet interference.
- http://surya.cft.ca.us/hostname.txt OK (6 bytes). Path was Surya →(via OpenVPN) Jacinth → Holly →(via Wi-Fi) Piki.
- http://surya.cft.ca.us/logo-image.jpeg (100kby, same path)
- http://claude.cft.ca.us/logo-img.jpeg (54kb)
- https://www.jfcarter.net/logo-img.jpeg (54kb, same file)
- http://jacinth.cft.ca.us/logo-image.jpeg (24kb)
- https://jfcarter.net:1447/logo-image.jpeg (24kb, same file) (with http its status is 400 bad request).
- http://en.wikipedia.org/ (97kb, with -L to follow redirects)
When Selen uses the URL http://en.wikipedia.org/ the content is obtained and displayed, including images, but transport is very noticeably slower than normal, and packet loss rate goes way up on all Wi-Fi, time evolution of the loss rate is near identical on all Wi-Fi hosts (as pinged from Xena which is on Wi-Fi to Holly). This suggests a packet storm, and though no packets go to Xana or Holly, bandwidth in channel 11 is saturated and unrelated packets are either collided with or otherwise forced off the net.
Downloading (curl) a bigger file, 1.46Mb, http://jacinth.cft.ca.us/video-test/video/katamari-star8-10s.wmv
- Jacinth → Holly →(Wi-Fi) Xena. Speed: 3.0e36 by/s, no noticeable interference.
- Jacinth → Beaver →(Wi-Fi) Piki. Speed: 8.53e4 by/s, and Wi-Fi to Xena was 100% blocked. But the checksum of the data was correct, i.e. trashed packets were retransmitted, either by 802.11 layer 2 protool, or TCP (can't tell which).
I was planning to do a dist-upgrade on Piki, but if it horks the net when it downloads over Wi-Fi, I'd better not try it.
Start turning off Piki services and see what makes the packet storm stop. First: Geneve off. Still packet storm. Next: hostapd.J off. Stormy again. Physically remove the NIC. Still stormy. Now all that's running is the RPi's onboard Wi-Fi NIC.
Positive control, Xena downloads a bigger file (16.7Mby) via Holly, 2.82e6 by/s, increased packet loss rate was measurable but just barely.
Could Beaver be at fault? I switch Xena over to Beaver and download both files again. The 1.46Mby file came in at 3.71e6 by/s; the 16.7Mby file's speed was 2.73e6 by/s, not distinguishable. So Beaver is not at fault.
Let's try the converse: make Pikiwf connect to Holly instead of Beaver. Confirmed that holly is getting unicast traffic from pikiwf. The 1.46Mb file's download speed was 5.27e5 by/s and Xena → Holly traffic lost about 20% of packets. 16.7Mby file's speed was 2.82e6 by/s and zero lost packets! Repeating 1.46Mb file, speed this time was 3.34e6 by/s. Both of these speeds were equivalent to Xena → Beaver. Switching Pikiwf back to Beaver; confirmed that Beaver is getting unicast packets from Pikiwf and Holly isn't.
Repeating tests on Beaver. 1.46Mb file's speed is 3.06e6 by/s! 16.7Mby file's speed was also 3.06e6 by/s, possibly 1 lost packet. My next intervention was going to be to turn off Geneve on Beaver, but I deferred that test.
Now replugging the Za-pai NIC and starting hostapd.J and geneve.J. On Piki, hostapd came up and AP-ENABLED with no hassle. Beaver, however, exuded Dead loop on virtual device br0, fix it urgently! (19 reps, 2 for gen1, the rest for br0). This was just after Pikiwf hostapd.J+geneve.J started. Noticeable packet interference while that was happening. Now, 1.46Mb file's speed was 3.40e6 by/s and 16.7Mby file at 3.3336 by/s with no packets lost during either download.
Selen (cellphone) connecting to Piki → Beaver. Originating a connection to Wikipedia, it can navigate normally at perhaps slightly slower than normal speed, and no packet interference. But there's a consistent problem with https://www.jfcarter.net:1447/ (and …/~jimc/); it times out. Likely a routing issue, since this is going to go to Surya and DNAT back to Jacinth.
Leaving Piki off overnight.

What could be causing Piki's onboard NIC to trash 97% of its own packets? What I've learned so far:

I tried exchanging Beaver's and Holly's roles. Couchnet was served by the Alfa AWUS036ACM on (Beaver/Holly) and CouchNet-(Holly/Beaver) was served by the Terow on (Holly/Beaver). Geneve.J was shut off (disabled and not running) on both. Wi-Fi on Holly was perfect, i.e. well under 1% packet loss. Wi-Fi on Beaver usually loses at least 3% packets, with 4 echo requests/replies per sec and nothing more, no downloads or terminal session action, and from time to time (regularly about once/min plus random) it has a few secs of near-total loss. This has been going on since I first got Beaver.
So what differs between Beaver and Holly? net-geom.J -v may give a clue. Nobody here but us chickens; all but Jacinth Surya Xena Petra have the generic configuration: IPv4+6 address on br0 or en0, default route (4+6) via Jacinth, no other special routes, Beaver and Holly have their radios as members of br0.
ip -d link show dev br0 (on Beaver and Holly): Differences: gc_timer = 125.55/88.66 (irrelevant); otherwise identical.
ip -d -4/6 route show: identical except for default route expires time (irrelevant).
ip link show master br0: Identical except: Radio is rad0 vs rad1; MAC addresses differ; Beaver en0 has qdisc mq while Holly en0 has pfifo_fast. en0 is a member of br0.
The qdisc pfifo_fast is currently the default for some queues; other values seen are noqueue or none. I searched my config files and setup scripts, finding no mention of qdisc mq. Documentation for mq is hazy, possibly because of using a different name (mqprio maybe). MQ is described in relation to linuxnet-qos which seems to be an object oriented programming framework involving net traffic control. It says, The mq qdisc is automatically instantiated (by who? kernel or their backend?) as the default root queuing discipline for interfaces with multiple hardware queues. It is conceivable that the Raspberry Pi 4B (Beaver) Ethernet controller has multiple hardware TX queues while the RPi 3B (Holly) doesn't.
Beaver downloading files with curl, starting with qdisc mq:
- http://jacinth.cft.ca.us/video-test/video/katamari-star8-10s.wmv , 1.46Mb, Jacinth →(Ether) Beaver, under 1 sec. Repeated with a 16.7Mb file, still under 1 sec.
- We're supposed to be testing downloads over Wi-Fi. Jacinth →(Ether) Holly →(Wi-Fi) Xena: 1.46Mb, under 1 sec. 16.7Mb 2 sec, normal speed for this link.
- Switching Xena over to Beaver. 1.46Mb, under 1 sec. 16.7Mb, 3 sec. 154Mb, 22 sec. These are credible speeds for the available hardware. In all of these, there were a few ping packets not timely answered, estimated 1% to 2%, but earlier tests were far worse, a lot over half the packets lost.
Strange observation: Xena is associated with Beaver, main CouchNet is via Holly. For about 3min there's a burst of interference, about 20% packet loss to Iris (generic LAN host) and Beaver, but much less, maybe 5% loss, to Holly, which would take a hop on the LAN from Beaver. Weird. Verified that Xena is associated with Beaver, not Holly. After 3min, Wi-Fi returns to occasional lost packets, at most 0.1%. Signal strengths: Beaver -35dBm (2 meters away), Holly -40dBm, nearest competitor on the same channel: -65dBm, other neighbors -80dBm.
These interference bursts come and go seemingly at random. I don't believe they're from my sources, but there's also no evidence of malevolent sendings from outside.

This morning I switched Selen (cellphone) to Beaver and went through a lot of links. Most but not all delivered the pages (one timed out), but definitely slowly compared to Holly. Let's do tests where Selen (Wi-Fi to Beaver) plays the videos that yesterday were being downloaded to Xena by curl. What packets does Beaver see?

The 1.46Mb WMV file: On Holly it was downloaded in under 1 sec. On Beaver there were about 30 groups (payload in from Jacinth, out to Selen, Selen acks, out to Jacinth), 1448 bytes each, 43kb captured, 1837 pkts received, 200 captured (due to -c 200), 89% dropped, 400kb accounted for.
If I captured 750 pkts I might cover the whole download. Let's give it a try. With -c 2000 it covered 0.47 sec of the download, while 1.46Mb should have fit in 1009 pkts and at 54Mbit/s (typical on this link) it would have taken 0.22 sec. 2000 pkts captured, 2046 pkts received, 0 dropped, 46 mismatch the filter. VLC tends to download several Mby to its buffer at the start of performance, if the source will deliver it (vs. pseudo-isochronous streaming), which this source will. Conclusion: at last I'm seeing more packets than expected, which I had a lot of trouble to prove were present.
In all of the above thrashing around, Xena was pinging Beaver (and others) and every packet was timely answered; in past serious clog-ups many packets aren't answered within the tester's accept window.
Can I prove, from the ack window, that some packets are duplicated? It would take a lot of work. Instead I'm going to try to implement traffic control, limiting the bandwidth on the Wi-Fi NIC to slightly less than its actual capacity. I took the time to improve my traffic control script (described here).
I finished the improved script and it seems to be working, and specifically it limits the achieved data rate to close to the configured value, which is less than the maximum that the interface could do (for testing, and in some cases needed operationally). Now I'm going to re-do the download tests with traffic control active.
Conditions:
- Selen (Android) plays a short video (1.46Mb, resident on Jacinth), played with VLC, which in this case is going to fill its buffer with the whole file as fast as the net will allow. URL: http://jacinth.cft.ca.us/video-test/video/katamari-star8-10s.wmv (no https).
- The access point is Beaver, with traffic control active, and with the Geneve tunnel turned on (but nothing using it).
- Xena, on AP Holly, will run pinger to iris (irrelevant, negative control), Beaver and Selen. If all destinations fade out, this means a packet storm on the Wi-Fi physical layer. If only Beaver or Selen fades out, this means collisions or overload just on that host. As a matter of Android policy, Selen usually responds to only about 25% of ping packets.
- On Beaver: tcpdump -l -i rad0 host selen and host jacinth
- Big packets on this link have size 1448by when MTU = 1500by. So the file would fit in about 1012 packets. There should be about 1012 such packets from Jacinth to Selen, and an equal number of short ACK packets from Selen to Jacinth. (If I capture with -i any, there will be similar payloads and ACKs on en0.)
- I suspect there are packet collisions, and I'm going to need to learn how to use Kismet, to detect them.
- Result of first test run: 1448 packets received+captured, 0 dropped. Duration: 10.4 sec, equal to visible playing time. 1011 packets received of length 1448by, 435 packets from Selen (mostly length 0 but including GET request), 2 other packets. The expected number of packets were received from Jacinth, and a believable number of ACKs were sent by Selen. Video performance was normal. No ping packets were dropped except, as usual, Selen responded to only about 25% of them.
Next test, similar to the above but with Piki connected to Beaver and Selen connected to Piki. On Piki, geneve.J and hostapd.J are active. tcpdump as above will run on Piki (and Beaver if needed).
Need to think about this: On Piki, ip -4 route show shows a route to 192.9.200.192/26 (local LAN) via rad9 (correct, working) and also via br0, but no default route. These are prefix routes. Members of br0 are en0 (normally not connected), rad7 (the access point), but missing gen0 which does not exist anywhere. I restarted geneve.J; gen0 is back. But the routes are unchanged (IPv4+6). I'm proceeding with the test without tampering with (fixing) the routes.
Connections from Selen to https://www.jfcarter.net:1447/etc time out. Connections to http://jacinth.cft.ca.us/etc are redirected to that URL and time out. When Selen connects to Beaver, these work. I showed the http page, changed to Piki, and refreshed, and it showed the same page (good). Proceeding with the test.
Can't follow links to other URLs on Jacinth like the video test file page. Could it be a DNS issue? I'm going to try opening an OpenVPN connection to Jacinth 1194. Didn't help; since the tunnel can't send bearer packets through the tunnel, and is too dumb to send bearer packets and other packets by different routes, no traffic to Jacinth goes via the tunnel. (Actually the issue is that policy routing is available on Linux but not other OS's that OpenVPN wants to be portable to.)
Next try: transplant the test file to Iris. When Selen is connected to Holly or Beaver, I can play the test file, but if connected to Piki, I can't follow the link. I also tried Iris' numeric IP (in case of DNS issues), didn't help. This is with and without the VPN.
Selen connected to Piki, web browser to Piki's IPv4 address. The index page is shown, with the logo image. There is a ssh session on pikiwf IPv6 (the uplink) (correct). Command line on Piki, tossing this ssh traffic; if you print it, you also have to print the traffic by which you print… Chicken and egg issue: you get an omelet.
tcpdump -l -i any not host pikiwf
Ditto except web browser to Beaver's IPv4 address. The index page was shown promptly, with the logo image. There were lots of echo request and reply between Selen on rad7 (the AP) and gen0 (Piki's Geneve tunnel). But no sign of HTTP traffic which should have been seen both on rad7 and Geneve payloads.
How to put both an AP and a managed interface on the same Wi-Fi NIC: Do iw list. You get a sequence of sections each titled Wiphy phy$N where $N is an integer. Guess which is which; probably phy0 will be the onboard NIC, and phy1 will be the USB NIC because it is initialized later. Look for supported interface modes. You should find Managed and AP (and others). Then look for valid interface combinations. Sometimes the descriptions are kind of cryptic, but you need it to allow at least one Managed and AP simultaneously. Usually you see channels <= 1 (both on the same channel).
Now create a normal managed Wi-Fi interface configuration and bring it up. And start hostapd with your normal setup script. They coexist happily if that's a valid interface combination.
192.9.200.206 pikien.cft.ca.us b8:27:eb:d4:e5:13
If you bring up rad7 in managed mode, it works and can connect to an AP (Beaver). But hostapd will not start; can't init the NIC. If you start hostapd first (AP mode), it works, and beacons are emitted. Stations can associate but communication is hosed (routing?) and the station can't get a DHCP address. If you bring up rad7 later in managed mode, it works and can connect to Beaver, beacons come out, but stations can't get a DHCP address. If you take down rad7 managed, no more beacons.
Problem may be that Geneve is hosed on Piki. Geneve can't come up unless pikiwf (rad7 or rad9 managed) is up. I.e. there's a route to the server for Geneve bearer packets.
Another issue: a managed Wi-Fi NIC is not supposed to be in a bridge. In this design, it is supposed to send direct to the server (in this case Beaver). If it were in the bridge, and if the Geneve tunnel to Beaver were running, it would cause an instant packet loop. Tests above may or may not have been affected with these issues.

Subsequent testing:

geneve.J (setup program) made assumptions about en0 which I was violating, with the result that en0's IP address was not found, and the program did not check for that, and got a syntax error. Fixed I hope.
These were the conditions on Piki: en0 is connected, but neither Wi-Fi nor Geneve are up. en0 is plugged into the same hub that Iris is. Piki can download video files of various size from Iris (curl $URL | sum) at about 88 Mbit/sec with no impairment of any other connections. (HTTP not HTTPS which is substantially slower.) The theoretical max for Ethernet on RPi-3B is 100 Mbit/sec. (Beaver, a RPi-4, theoretically can do 1 Gbit/sec.)
With these conditions: en0 connected, rad9 (internal Wi-Fi) up and connected to Beaver, hostapd.J (rad7) down, Geneve down. A special route sends the Iris connection via rad9, not en0, i.e. Pikiwf → Beaver → Iris, and replies should have taken the reverse path since rad9 and en0 have different IP addresses. The average payload rate was 760 kbit/sec, whereas 22 Mbit/sec is normal for this link. Packet loss rates for all connections were near 100%, including Xena → Holly, which did not participate at all in any of this traffic. So I believe that a packet loop was saturating the aether and kicking the Xena → Holly link off Wi-Fi. This is the symptom I am trying to get rid of.
So how could the packet loop occur? Let's trace a broadcast packet such as an IPv4 ARP request (IPv6 neighbor discovery would be multicast, but all the participants are subscribed to the relevant local LAN scope group, so the effect is the same.) It could originate on any host, but let's start on Piki. The packet will go out on rad9 to Beaver (bridged to the LAN) and on eth0 direct to the LAN. Piki's bridge currently has no members, thus no outgoing ARP packets. The en0 packet will be received on Beaver's bridge and will be sent out on Wi-Fi to all Beaver's stations, specifically Piki rad9. Conversely the rad9 packet will leave Beaver to the LAN, and Piki en0 will receive it. Will Piki send either of these packets any farther?
Weird: piki en0 has an IPv6 address of 2600:3c01:e000:306::cf which belongs to Piki. net-geom.J knows that it should be deleted (i.e. is probably not what put it there). Removing it did not help the packet loop. Pretty sure wickedd-dhcp6 thinks it has a lease with this adr and it periodically re-adds that address. kea on Jacinth has a record of the lease; Piki doesn't.
Found a culprit! Piki was a router, and shouldn't be. /etc/sysctl.conf net.ipv4.ip_forward = 1 and friends were turned on. With both en0 and rad9 active, they forwarded to each other through Beaver, producing the packet loop. Also /etc/sysctl.d/70-yast.conf has a cryptic copy of the relevant settings; rename to 70-yast.conf-OFF. For the fix, you need to set ip_forward etc. explicitly to 0; they won't go off by themselves. See next paragraph for the filenames.
Better just reboot Piki after this change. Steps after rebooting:
- Make sure Beaver's hostapd.J is running, used by rad9.
- ip addr del 2600:3c01:e000:306::cf/112 dev en0 (not critical but annoying)
- grep '[0-3]' /proc/sys/net/ipv4/ip_forward /proc/sys/net/ipv4/conf/all/forwarding /proc/sys/net/ipv6/conf/all/forwarding
  (make sure all three are 0)
- ifup rad9 (takes about 6 sec to come up)
- ip route add 192.9.200.203 dev rad9 (download test files over Wi-Fi, not en0)
  ip route add 2600:3c01:e000:306::cb dev rad9
- You don't want Piki's hostapd.J or geneve.J to be running, for this test.
Retrying speed test (curl $URL | sum):
- Starting conditions: en0 up, rad9 up, to Beaver.
- Via en0: 10.5 Mby/s
- via rad9: 92 Kbit/s. Still packet loop.
- With en0 down, only rad9 up: 90 Kbit/s, packet loop.
- Switch rad9 to Holly (and en0 still down): it associated promptly (10 sec).
- Speed test: 1.7 Mby/s, no packet loop. Didn't interfere with Xena.
- Test with longer files (previous ones used a 1.4Mb file). 16.7Mb: 1.6 Mby/s. 51.4Mb: 2.0 Mby/s.
- Conclusion: Holly does not get into packet loops; Beaver does.
- (Switching back to Beaver.)
Investigating on Beaver.
- Could it be Geneve? There is no Geneve on Piki, but gen1 (server) is available on Beaver. Stopping geneve.J; gen1 is gone on Beaver. Speed test (with the 1.4Mb file): 2.8 Mby/s; 16.7Mb: 1.6 Mby/s; 51.4Mb: 2.4Mby/s.
- Conclusion: Beaver without Geneve avoids packet loops. So how does the packet loop happen, when nothing connects to gen1?
- Speculation: Piki rad9 → Beaver rad0 (in bridge) → gen1 payload → Piki rad9 (bearer) → Piki gen0, which in this case doesn't exist. Resulting in ICMP connection refused, out on rad9 → Beaver rad0 …
- If Geneve were running on Piki (which is not a router), the payload is unicast to other than Piki, and would be tossed. This backchannel is wasted resources but only doubling the traffic, hence no packet loop.
- Let's turn on Geneve on both Beaver and Piki. What happens? To start with, geneve.J on Piki doesn't create gen0. Sticking with the current thread, I'll use the -X option to force it on. 1.4Mb file: 2.3 Mby/s; 16.7Mb: 3.0 Mby/s; 51.4Mb: 2.4 Mby/s with a noticeable but not drastic increase in packet loss on Xena.
- Next job is to get Piki's geneve.J to recognize that it's a range extender (with Geneve off for Piki and Beaver). X-Windows native protocol is so inefficient, when I scroll an edit window on Piki, traffic on rad9 inteferes more with Xena Wi-Fi than downloading big movie files! The problem was that Piki's hostapd was shut off, so obviously it's not a range extender nor an AP. The right response is to override manually with -X.
Now I'm going to turn on Piki's AP (with Geneve).
- hostapd.J and geneve.J are running on Beaver and Piki. Wi-Fi (rad9) is up on Piki. For the first test, so is en0.
- On Piki: curl $RL | sum . 1.4Mb: 93 Kby/s. Packet loop.
- Killing en0 and repeating the test. 1.4Mb: 94 Kby/s. Packet loop.
- tcpdump on Iris; Each packet is counted twice due to -l any. There were 653 of length 1428 by and 315 of length 2856 (double). Dividing by 2, 641 pkts x 1428 byte payload len which covers about 2/3 of the payload file. Covers about 7.50 sec elapsed; transmission was interrupted after about this long with 43% of the file received on Piki. Were there duplicate packets? Too hard to match up sequence numbers. If no pkt loop, this file can be sent in under 1sec. There were stretches where Piki sent several zero length ACK packets, with nothing coming in from Iris.
Analysis of packet loop situation:
- On Piki, br0 en0 and rad9 each has its own IP address (if up).
- If Piki downloads a file it causes a packet loop, in many but not all configurations.
- The least disruptive test is to download the 1.46Mby video from Iris, with a special route (IPv4+6, 6 actually used) to send to Iris from Piki rad9. Iris replies to pikiwf (rad9, via Beaver's AP rad0) due to rad9's unique IP. Confirmed by tcpdump on Piki.
- tcpdump on Piki (-i any) shows no defective packets, no repeated packets. But only a fraction (3%?) of packets get through, from any source to Piki rad9. Estimated from the data rate with and without the packet loop.
- Survival of packets outbound from Piki has not been measured.
- It is not proven whether the (unseen) looping packets are payloads, or admin packets such as ARP. But while ARPs are seen on Piki, they are rare and are at times when an ARP seems appropriate. So I'm concentrating on payloads.
- On Piki, the packet loop happens or doesn't, independent of whether these features are turned on or off: hostapd (rad7), en0, Geneve. It ins't proven that the looping packet type is the same for every configuration.
- Only traffic from Piki hostapd's stations (if up) exits via br0 and Geneve (if up) , whereas all local LAN traffic (as seen by Beaver) is bridged to Piki's br0 via Geneve (if it's up on both Beaver and Piki).
- Beaver has en0, rad0, gen1. en0 is the local LAN uplink and so is required. rad0 is required, so Piki rad9 can act as the uplink. Geneve is required for the packet loop even if Piki has it turned off, in which case Piki would reply ICMP connection refused.
- If both machines have Geneve off, there is no packet loop.
Tracing a payload packet (mentally) from Piki to Iris and back. Conditions: Beaver and Piki both have Geneve and hostapd, though Piki's hostapd is not supposed to be used in this analysis. Beaver has Ethernet; Piki doesn't, but does have Wi-Fi. Piki starts by sending a HTTP GET request to Iris.
- Piki sends GET (destination is Iris) from rad9 → Beaver rad0.
- rad0 being in Beaver's bridge, the packet is transferred to en0 (out to Iris) and to gen1. Wrong, Piki br0/gen0 has never(?) been seen to send a packet from Iris, so Beaver would only send a new packet if it were flooding all bridge members.
- Beaver creates a Geneve bearer packet, MAC addressed to Piki, containing the GET packet, and emits it on the bridge. As Piki is one of rad0's stations, it leaves via rad0 → Piki rad9.
- Piki gets the bearer packet and unwraps it, finding a GET packet destined to Iris. We handle it starting from the first step above.
- A ferocious packet loop saturates the aether and things go downhill from there.
Learning to use Kismet. https://www.venea.net/man/kismet(1) -- UNIX style man page. https://www.kismetwireless.net/docs/readme/intro/kismet/ Use the … menu to get a sidebar with the toplevel table of contents. It opens on Introduction; pick subsequent topics. Kismet needs user kismet, group kismet, homedir /var/lib/kismet . To start the client, just kismet, no command line arguments. It emits INFO messages, and tells you to connect a browser to http://localhost:2501 . You need to specify an internal login, saved in executing user's homedir, ~$USER/.kismet/kismet_httpd.conf This is called the administrator login and PW. In the pop-up, pick settings, and set something. I ended up not changing anything. Now how do you add a data source? Example: kismet -c rad0 Then web browser to http://localhost:2501/
You also need to put the interface in monitor mode. Possibly only on particular chipsets like Atheros. Possibly mac80211 driver can let monitor and managed modes coexist, if the NIC supports it. Install package aircrack-ng (SuSE name and many other distros) and run airmon-ng (there's a man page).
Trying another approach. On Xena I put rad0 in monitor mode: airmon-ng start rad0 It complained that NM, wpa_supplicant and avahi-daemon could change rad0 back to managed mode and I should kill them, which I did. To check: ls /sys/class/net/ ; look for rad0mon. Now you can run tcpdump and see packets, but they're encrypted, can't see payloads) Most are beacons.
Test procedure: Start tcpdump on Xena. Have Piki download a video file. Stop tcpdump. Dig through a zillion garbage packets. Specific setup steps:
- Beaver: systemctl start hostapd.J -- Terow was frozen; unplugged and plugged again, came to life. geneve.J auto starts.
- Piki: ifup rad9 -- associated with Beaver. Use ifstatus rad9 to check it. Can ping pikiwf.
- Piki: ip route add 192.9.200.203 dev rad9
  ip route add 2600:3c01:e000:306::cb dev rad9
- Piki: systemctl start hostapd.J -- It started. Can ping piki.
- Xena: tcpdump -l -i rad0mon -n >& $j/mon.tcp
- curl $URL | sum -- Speed 96Kby/s, high ping packet losses to pikiwf (rad9). Elapsed time 15sec, data 1.43Mby. The packetvpore was definitely doing its thing.
- Xena: kill tcpdump. What did I get? 2 irrelevant packets from the neighbors. Surprising.
Next try: turn off pikien (en0) and try again. Speed 90 Kby/s, high ping losses to pikiwf (rad9), elapsed 16 sec. Packetvore is active. tcpdump captured 25 packets (probe requests) from CouchNet-Beaver, and 742 packets total, none dropped. I have a feeling that not all packets are being reported by rad0mon.
Duh, the channel wasn't set and it was receiving the default channel (probably 1), which had little activity. Command line on Xena: iwconfig rad0mon channel 11 . In 8 sec it received 802 packets of which 78 were identified as CouchNet-Beaver. Most or all were beacons.
Trying the download again with Xena rad0 in monitor mode on channel 11. Results as before, packets eaten. So what did I get this time? 6802 packets captured (vs. about 1000 can hold the whole file), 0 dropped. 275 packets identified as CouchNet-Beaver, all were beacons. curl elapsed 15sec at 94 Kby/s, Beaver rad0 MAC is 00:13_ef:5f:0c:3c . 1166 packets had this MAC, all were tagged as Acknowledgment. Piki rad9 MAC is b8:27:eb:81:b0:46 . 1166 packets had this MAC. I can distinguish Beaver vs Pikiwf sendings from the signal strength. They come in sets:
- Piki Request-To-Send (Piki MAC)
- Beaver Clear-To-Send (Piki MAC)
- Piki Data (no MAC, no length field, but it is encrypted.
- Beaver Data (ditto, different flags) (this should be the large payload packet)
- Piki Acknowledgment (Beaver MAC)
- (ringc1 to Holly, BAR, don't know what this is, extraneous)
- Repeats from the top. 5 packets per payload. So 6802 packets is sort of credible; few if any duplicate packets are seen, and particularly, the high ping packet loss rate is not seen in this analysis of the data.
In Wi-Fi Evolution on CouchNet in 2021 I did ping tests and found that the round trip time was a lot longer, like about equal to the beacon interval of 100ms, when the AP transmitted first and the station replied, compared to the station transmitting first, where the round trip time was typically 5ms (extremes: 1.2ms to 19.4ms).
If Piki to Beaver communication were somehow limited to one packet per 100ms, the data rate would be about 1.5e4 by/s, compared to the actual 9.3e4 by/s (varies ± 1e4 by/s), so I think this asymmetry in being able to initiate a connection is a red herring. However I'm checking it out further.
Setup for the following tests:
- Xena: pinger beaver holly piki pikien pikiwf
- Beaver: systemctl start hostapd.J (starts geneve.J also)
- Beaver: netpolice.new start
- Piki: power on. Get session on pikien.
- Piki: ifup rad9
- Piki: Get a session on pikiwf, exit from the one on pikien.
- Piki: ifdown en0
- Piki: systemctl start hostapd.J (starts geneve.J also) (Except it doesn't; have to stop it and start again).
- Piki: netpolice.new start
- To test: curl http://pikiwf/vidtest-146.wmv | sum
  (Or Beaver). This file is 1.46e6 bytes long. (1.43MiB)
Test results:
- Piki downloads 1.82e5 by at 9.1e4 by/s, no Xena prob.
- Beaver downloads 1.82e5 by at 5.2e5 by/s, no Xena prob.
- Piki downloads 1.46e6 by at 9.23e4 by/s, near 100% ping packet loss on Xena.
- Beaver downloads 1.46e6 by at 7.45e5 by/s, no Xena prob.
- Piki downloads 4.5e7 by: Not tested, would wipe out the net.
- Beaver downloads 4.5e7 by at 7.57e5 by/s, no Xena prob.
Downloading the 1.82e5 byte file and running tcpdump:
- Command line: tcpdump -l -i any host beaver or host piki or host pikiwf >& $j/dnld.tcp
  curl http://pikiwf/American_Beaver.jpg | sum
- Beaver downloads, tcpdump on Beaver: Beaver sends 2 packets to pikiwf:geneve, payloads are ARPs for 2 irrelevant hosts from Jacinth. (Correct to send them.) Response: port geneve unreachable. Troubleshooting:
  - systemctl status geneve.J -- Running.
  - Nothing in Piki: /var/log/firewall
  - On Piki: ss -a -p -4 -u -- Sure enough, nothing is listening on 6081 Geneve (IPv4 UDP) and no Geneve NIC is in the bridge (only rad7, the AP). Got to fix this.
  - Service cmd line: geneve.J -c -G beaver -E pikiwf
  - When run by hand it kills and creates the tunnel.
  - Mystery, but I'm going to re-run Beaver downloads.
- Beaver downloads 1.82e5 by at 6.38e5 by/s, no Xena prob.
- Beaver downloads 1.46e6 by at 6.92e5 by/s, no Xena prob.
- Beaver downloads 4.5e7by at 7.42e5 by/s, slight ping packet loss on Xena, about 10% or less. 7.42e6 by/s = 515 pkt/s.
Back to tcpdump.
- Beaver downloads 1.82e5 by; tcpdump on both. On Beaver, one irrelevant ICMP echo captured. On Piki, 27 pkts captured, 3 complete IPv6 echo request + reply from Xena to Piki via Geneve, plus some irrelevant ARPs via Geneve. Nothing from the download.
- Piki downloads 1.82e5 by; tcpdump on both. On Piki, 5 packets received, irrelevant stuff via Geneve. On Beaver, 554 packets captured. Every single one was irrelevant. Filtering cmd line: grep -v -E -i 'arp|icmp|echo|ssh|4253|domain' $j/dnld.tcp Nothing captured from the download at either end.
- Trying again with cmd: tcpdump -l -i any port 6081 or port 80 Piki downloads 1.82e5 by; tcpdump on both. Speed 9.17e5 by/s. No effect on Xena. At both ends, just 1 ICMP6 echo request captured, nothing about the download, but Beaver received 830 (other) packets and Piki got 758.
- Trying with no filter. Piki captured 26 pkts (of 1088). Beaver captured 488 pkts (of 2384). Filtering out ARP ICMP echo ssh domain+4253: each end had zero packets. Visual inspection confirms that there are no packets to or from port 80, i.e. part of the download. This includes Geneve payloads (there were lots of excluded Geneve payloads, e.g. ICMP6 echo).
- So if Piki receives 1.82e5 by of beaver photo from Beaver, and it neither leaves Beaver nor arrives on Piki via any network interface (-i any), how is it being transported, mental telepathy?
Testing next day: with Piki en0 down, and Piki geneve.J restarted by hand so gen0 actually exists, Piki downloads 1.82e5 by from Beaver at 2.90e6 by/s which is a very plausible speed. No effect on Xena. Beaver downloads 1.82e5 by from Piki, ditto except only 1.34e6 by/s.
Piki downloads 1.46e6 by from Beaver at 3.22e6 by/s, no Xena prob.
Piki downloads 4.53e7 by from Beaver at 3.30e6 by/s, no Xena prob.
I think this sucker is fixed! The problem is (suspected to be) that Geneve spuriously failed to start immediately after hostapd.J started. Got to fix that.
Selen connects to CouchNet-Piki.
These sites could be connected to: Wikipedia, Google front page, NOAA Weather, SCEDC Recent Earthquakes, KUSC front page.
These sites timed out, DNS failure suspected: Jacinth front page, Jimc's home page on Jacinth, Jimc's site on Claude, Home Assistant on Dragon, MyUCLA patient portal, SuSE package search, Packman.
Got past DNS but didn't finish loading: Amazon.com, Fidelity.

Progress and Testing in Semi Production

Now I'm setting up Beaver and Piki for semi production.
- Beaver: hostapd.J (with geneve.J) is already enabled, but gets shut down manually due to my paranoia.
- Piki: en0 used bootproto=dhcp which is probably why it was getting a spurious instance of Piki's IPv6 address. Changed to static, and startmode=manual. rad0 had startmode=manual, changed to auto.
- Piki: hostapd.J (with geneve.J) was newly enabled.
- I rebooted both machines. Beaver came up OK. Piki didn't. I'll have to fix it tomorrow.
Deployment checkout:
- A lot of the routing problems were because no default route. ifcfg-en0, rad9, br0 were changed from BOOTPROTO=static to dhcp, and it now gets the default route (Jacinth) that way.
- The above interfaces had STARTMODE=manual, all changed to auto. When en0 isn't supposed to be up, I'll just unplug it rather than telling Wicked to not start it.
- Piki can't ping xena and petra (4+6). The packets ended up on the wild side. This is Jacinth's fault, possibly due to experimenting connecting Xena to CouchNet-Beaver. Not Piki's prob.
- net-geom.J wants to add a default route on br0 via Jacinth, and to toss a spurious aleatory address on br0. It's not messing with en0 or rad9. I'm leaving net-geom.J disabled until Piki can ping Xena and Petra.
- Testing video download: Piki downloads from Beaver. 1.46e6 by at 1.16e6 by/s (slower than yesterday, why?) 4.5e7 by at 1.09e6 by/s. In that one, with Xena associated with Beaver, Xena lost ping packets but under 10% were lost. This is not surprising. Traffic shaping (netpolice.new) was not active for either test.
- Selen associated with Piki. Prompt connection unlike before. Connecting to LAN and wild side webservers.
  Connected OK: Wikipedia, Google front page, SCEDC earthquake site, KUSC front page, Amazon.com, Google (Android) Play Store downloads.
  Botched DNS: MyUCLA Patient Portal, Jacinth, Claude, Home Assistant on Dragon.
  DNS passed but timed out: NOAA Weather, Fidelity.
  This is just about the same set of winners and losers as yestarday.
- Traceroute from Selen → Piki → Beaver → Jacinth replies in one hop, due to bridging and Geneve. Ditto from Jacinth to Selen, and Jacinth can ping4 Selen (no DHCP6 on Android so not tested).
I'm transferring Xena's Wi-Fi to Piki and trying to diagnose the routing issues seen on Selen.
- Xena associated right away and the VPN started. Ping replies resumed after a second or two pause. But the TCP/SSH connections to various hosts were terminated.
- curl http://beaver/vidtest-146.wmv -- nothing downloaded (but a HEAD request succeeded). Same result from Piki.
- Not much useful info from this trial.
Selen associated with Piki, tcpdump on Piki looking for anything coming from Selen. What is it getting stuck on, particularly for local LAN clients?
- No packets captured. That was Firefox loading Jacinth (success except no Iris logo image). Repeating using H.E. tools: 2 A queries for a 3gppnetwork.org host (answered), but no sign of the test query I made.
- Looking at all traffic involving Selen, with H.E. DNS query. I got one ARP reply from Selen (to where? not shown) but nothing else.
- Firefox displays jimc's page on Claude (sucessfully). I got a lot of chatter with (various).googleusercontent.com and (stuff).compute.amazonaws.com, then chatter to Jacinth's wild side (port 443), then an interaction with Dragon (I didn't ask for it),
- Next Firefox asks (unsuccessfully) for http://claude.cft.ca.us/ . Selen sends HTTP to Claude, and Claude answers, syn-ack combination. Selen sends GET /, and Claude sends what might be the web page. They go back and forth acking the same byte range, and Selen eventually sends FIN, Claude replies FIN, and Selen sends RST.
- Do you suppose it could be a MTU issue? I don't see any returning ICMP packet too big packets.
  - Is df (do not fragment) set? Default is unset, but should be set.
  - Brainwave: expand the Geneve tunnel so the payload can be 1500 bytes long.
  - Per RFC 1191 section 4, the MTU is the size in octets of the largest datagram that can be handled; it includes the IP header and IP data, but not lower layer (containing) headers such as Ethernet and tunnel (Geneve) encapsulation.
  - Per RFC 8926 (proposed standard as of 2020), in the common case being implemented on Beaver, an IPv4 Geneve packet contains these segments:
    - an Ethernet header of 18 octets,
    - an IPv4 header of 20 octets,
    - a UDP header of 8 octets,
    - a Geneve header of 8 octets plus variable length options (currently zero length),
    - followed by the payload packet, which starts with a 18 octet Ethernet header.
    - Then comes the payload for which the MTU has to be calculated,
    - followed by a 4 octet checksum.
    - So the MTU of the bearer channel on IPv4 has to be 58 octets bigger than the payload MTU.
  - An IPv6 Geneve packet starts with:
    - a 18 octet Ethernet header,
    - an IPv6 header of 40 octets,
    - a UDP header of 8 octets,
    - the 8 octet Geneve header plus 0 variable length options,
    - an 18 octet inner Ethernet header,
    - the payload,
    - and the 4 octet checksum.
    - The IPv6 bearer channel's MTU has to be 78 octets bigger than the payload MTU.
  - When the Geneve tunnel is created the peer's IP address must be specified; thus it's known in advance whether the bearer and payload MTUs need to differ by 58 or 78 octets.
Test plan: First turn on DF on the Geneve tunnel, br0, rad9, rad7, en0; this will prevent the payloads from being fragmented, and per RFC 1191 the flow containing them should adapt to Geneve's bearer MTU. If that doesn't help, set explicit MTUs on various NICs.
Fakeout: MTUs were calculated assuming Ethernet header length. Per Gast, Matthew S, 802.11 Wireless Networks: The Definitive Guide (2nd ed.), O'Reilly 2005-04-xx, ISBN 9780596100520 , an 802.11 data frame has these fields:
- Frame control (2 octets).
- Duration (2 octets, how long before another station may try sending a packet.
- Three 6-octet MAC addresses (18 octets).
- Seq-ctl (2 octets); Seems to be a sequence control counter.
- We're up to 24 octets.
- Some frames (WDS, Wireless Distribution System) have another MAC (6 octets).
- 802.11ac has 6 additional octets of feature control, and it can hold multiple payload packets, up to a total of 11426 octets.
- Some frames have a 802.11i cryptographic header. (Common acronym: WPA2.) I wasn't able to find a definitive statement of its length. Around 32 octets is likely.
- Now comes the actual payload for which the MTU is needed, up to 2312 octets for 802.11b and g.
- Ending with a 4 octet checksum.
- Not counting the two optional fields, the header + checksum occupy 28 octets and the maximum packet size appears to be 2840 octets.
When Selen is associated with Beaver, but no traffic should be going through Geneve, Selen has extreme but less than 100% packet loss when trying to play music. With Geneve turned off on Beaver, it works fine.
My next test will use a radically reduced MTU on the Geneve tunnels of 1024 bytes, both Beaver and Piki. Both Beaver and Piki can't get IPv6 addresses… because the lower bound for MTU on IPv6 is 1280 bytes. Fixing this with -M 1380, so the Geneve tunnel can legally transmit IPv6 (with its MTU of 1322 by). Now br0 on Beaver and Piki have DHCP6 addresses.
With all that straightened out, Piki can download the 1.46e6 by file at a speed of 1.04e6 by/s which is pathetic but is not subject to the clog-ups seen before. Playing music on Piki from Jacinth: Failed, VLC could not connect.
More testing: Selen is associated with Beaver and tries to play music resident on Jacinth, URL = https://www.jfcarter.net:1447/~jimc/music/music.cgi/Hindemith_Music/2_Hindemith_Sym_Matamorphoses.m3u Wonder of wonders, it plays successfully. Changing to Piki: Android believes it's not connected to the Internet. VLC could not retrieve the above URL. Editing the URL of the music index page to http://jacinth.cft.ca.us/… -- could not load it. Could not load Jacinth's front page either.
I made Xena associate with Piki. The VPN connected to Jacinth via Piki. All downloads succeeded (same as when going via Holly). Turning off the VPN. That provoked problems. 1 download succeeded, to en.wikipedia.org, 9.75e4 bytes at 2.14e4 by/s (very slow). The rest all timed out. Soon thereafter, Wikipedia gave 0 length responses with a code of 200 (robot defense?)
Conclusion: there is something crazy with Piki stations, that does not affect downloads done by Piki itself.
Interesting observation: gen0 MTU=1322 (bearer packet MTU=1380), br0 MTU=1322 (minimum of all members?), rad7 MTU=1500. What would happen if I set rad7 MTU=1322?
- ip link change dev rad7 mtu 1322 worked.
- Selen associates with Piki. No maundering about can't connect to Internet. (But 10min later it brought back this complaint.)
- But it still can't open https://www.jfcarter.net:1447/ ;
- It does open http://jacinth.cft.ca.us/ including images (possibly cached);
- Could not open the test video on that site.
- Could not open Claude's front page.
- Couldn't open Google front page.
- Did open Wikipedia front page but timed out retrieving some but not all images on that page.
Another test: Selen associated with Beaver. Internet was not complained about. I was going to try to play music, but it loaded part of the music index page, froze, and timed out.
Today's test: going through Beaver step by step. Piki is associated with Beaver. Xena is associated with Holly (not being tested).
- Turning off Geneve on Beaver. This knocks out Xena → Piki pings, but pikiwf is still OK. On Beaver, net-assess all 6 downloads were OK.
- Selen associates with Beaver. Playing music from https://www.jfcarter.net:1447/… Played about 15min, no problems. br0 and rad0 both have MTU=1500.
- Turning Beaver's Geneve back on, and playing similar music. gen1 is in br0. br0 and gen1 have MTU=1442, rad0 MTU=1500. Played 20min, no problems. This is not the experience I had last night on Beaver.
Let's look carefully at the MTU issue. When geneve.J creates a tunnel (on Piki), it sets MTU=1380 for the bearer packets, and the tunnel NIC comes out with MTU=1322 which is 58by less (IPv4 size). On Beaver the bearer has the default of MTU=1500, and the tunnel NIC is 1442, 58by less (same as on Piki). What MTU is really needed? I'm assuming that DF is on and that senders will adapt their packet size to the available MTU. The MTU of pikiwf and of Beaver:rad0 needs to be -ge the bearer size. (T) The MTU of both tunnel NICs is set automatically to 58 bytes less, and this automatically sets both br0 to the same (or smaller) size. The MTU of Piki rad7 needs to be equal to the tunnel MTU.
Rather than reducing the MTU, let's try increasing the tunnel MTU to accomodate 1500by payload packets. No, that's no going to fly, because Beaver rad0 has to accomodate the packets too. Trying to get this right:
- Beaver br0, rad0, gen1 MTU=1500 (default). Oops, rad0 MTU has to hold the bearer packets, while gen1 will be 58by less, making br0 the smaller size, so the bearer packets can't get onto br0 and reach rad0.
- Let's bloat up Beaver: br0 contains en0, rad0 and gen1, for normal stations, all with the default MTU=1500.
- Beaver's rad9 (new) is for bearer packets only, in and out. Its hardware is the internal NIC, or possibly a second NIC on USB. Its MTU is 1558, which will have to be set explicitly, and it's not in a bridge, but it is operated as an AP by hostapd. Policy routing sends bearer packets to rad9 and Piki is expected to connect to it.
- Piki's rad9 is the internal NIC, with a MTU of 1558. It is a normal station in Managed mode, connecting to Beaver rad9. Policy routing sends bearer packets to it, but nothing prevents it from accepting a SSH connection too.
- Piki's br0 contains rad7 and gen0, both with the default MTU of 1500. It's an AP operated by hostapd. Normal stations connect to it.
- Piki's en0 is a standalone uplink, normally not connected, and not in br0. MTU is the default 1500by. It has its own IP and name (pikien).
I executed the above plan.
- I created and added beaverwf = rad9 to hostdata.db; installed all derived files; rebuilt kea.J DHCP assignments adding beaverwf.
- I created beaver: /etc/sysconfig/network/ifcfg-rad9 with static fallback IPs, STARTMODE=auto, BOOTPROTO=dhcp. No wireless configuration because it's an AP. Oops, no DHCP server is (or will be) on its not yet existing net, so it gets no IP adr from DHCP, and $ft/dhcp.S fails. Changed to BOOTPROTO=static. Now it has its IPv4+6 addresses. The LAN (Jacinth and Xena) can ping4 but not ping6 beaverwf.
- I created /etc/hostapd/09-onboard-d8:3a:dd:26:f5:2c
- I created beaver: /etc/systemd/network/50-rad9.link matching on the MAC address. OP Mr. Diba (2022-05-18) on StackExchange and a respondent found how to activate this without rebooting: systemctl restart systemd-udev-trigger.service
- But since rad9 has no Wi-Fi configuration info (SSID and password), ifup rad9 does set it UP but with NO_CARRIER, and no IP address can be assigned. Changing STARTMODE=manual.

The Side Sewer

After struggle, here's the side sewer I'm in:
- Beaver's rad0 is the Terow / Mediatek AC-1200Mbps, 00:13:ef:5f:0c:3c
- In hostapd.conf or equivalent, if you specify channel=11 it is ignored and it does ACS anyway, failing with these error messages:
  - brcmfmac: brcmf_set_channel: set chanspec 0xd022 fail, reason -52 It then says rad9: ACS-COMPLETED freq=2437 channel=6, ACS-ENABLED, AP-ENABled. (Is it sending beacons? I don't see them.)
  - Then it says, not specifying the interface until the end:
    ACS: Survey is missing noise floor (5 times)
    ACS: Channel 1 has insufficient survey data (6 lines repeated 11 times for channel 1-11.)
  - Nov it says:
    ACS: Surveys have insufficient data
    ACS: All study options have failed
    Interface initialization failed
    rad0: interface state ACS->DISABLED
    rad0: AP-DISABLED
  - After which hostapd exits, disabling rad9.
- Do you suppose it knows that I'm putting 2 APs on one channel, silently ignores both channel specifications, falls back to ACS, and things go downhill from there? I set channels 6 and 11 expoicitly. This actually worked! CouchNet-Beaver and CouchNet-Geneve are sending beacons.
- Next problem is geneve.J -- it's allegedly active but no settings are present. Restarting it. gen1 is in state UNKNOWN (UP,LOWER_UP are both present though). MTU is 1450 (from previous attempts). rad9 has MTU 1500, should be 1558. Got to fix these: now testing LOWER_UP (vs. state UP).
- I can do ip link set dev gen1 mtu 1500 But if I do the same for rad9 (mtu to 1558) it doesn't change. Trying to explicitly set MTU-=1558 in /etc/sysconfig/network/ifcfg-rad9 .
After an update, I tried deploying the current hostapd.J on Holly. Not wise.
- Selen thought it was associated with CouchNet (on Holly), and it communicated normally to sites on and off the LAN, and could be pinged from Jacinth. (Via cellular data, Selen could only get through Jacinth's firewall by a VPN or by opening a hole (authenticaion required for both), neither of which were done.) Selen had its correct LAN IP address. Yuna (iPhone) seemed to behave normally also, but detaied checking was not possible since it's not my phone.
- Other hosts associated with Holly and began an accounting session, but never obtained an IP address.
- In many attempts, hostapd's conf file included channel=11 but even so it started ACS (Automatic Channel Selection). Invariably hostapd logged, for each legal channel, ACS: Survey is missing noise floor, 5 times, followed by ACS: Channel 1 has insufficient survey data. It is likely but not assured that the NIC firmware is not capable of doing ACS (and in the past I have never succeeded on any of my NICs).
- As part of the fix, channel=11 was moved early in the conf file, after ssid=CouchNet and before hw_mode=g and ieee80211n=1, and channel 11 was used with no ACS attempt. Before the fix, it was a lot later, after ieee80211h=1 and a lot of other parameters. Before an unknown update in an unknown package this did not result in an ACS attempt. I'm using hostapd-2.10-2.10.aarch64 (failing). The order of parameters in the conf file is the same as found in /usr/share/doc/packages/hostapd/hostapd.conf (the example conf file with all known parameters annoted), except for a very few most of which are mentioned above.
- However, early vs. late setting channel didn't bring on bogus ACS. I tried systemctl start hostapd.J ; the conf file was identical, but this time bogus ACS was attempted. I don't see any difference between the command line I used and the one in the systemd unit file; I made the command line be logged (-v). Oooo, there's a big difference! It's not appending the generic conf file that has channel=11, which would fully explain why hostapd was attempting ACS.
- I fixed the bug in /usr/diklo/sbin/hostapd.J and now the specified channel is used, on both Piki and Beaver.
Now continuing with testing the plan for MTU. Not successful.
- Piki: rad9 is supposed to carry Geneve bearer packets. Since the payload MTU is 1500, I want rad9's link MTU to be 1558. But when I do ip link change dev rad9 mtu 1558 it says Error: mtu greater than device maximum. (And similarly with ip link set rad9 mtu 1558.) But setting the MTU to a lower value is allowed. rad9 is not in br0.
- Per RFC 894 dated 1984-04-xx, the minimum length of the data field of an Ethernet packet (and therefore the maximum length of an IP packet, so it says) is 1500 bytes. The kernel assumes that just about every IP packet will eventually end up on Ethernet, and I'm assuming this is the reason that it enforces an upper bound of 1500 bytes for the MTU, even on 802.11 for which the physical protocol's MTU is really 2346 bytes, and the Geneve bearer packets never ooze off to Ethernet.
- So I'm going to have to choke the payload MTU to 1442 or 1422 bytes (IPv4 or 6) so the bearer packets will come out to 1500 bytes.
- An alternative is to turn off DF (Do Not Fragment) on the bearer interfaces (Piki rad9 and Beaver rad9) and let the bearer packets always be fragmented. This is the default.
Testing non-DF on the bearer channel, in fact, non-DF everywhere, which is the default.
- As naively set up, it appears to operate completely normally. net-assess downloads all targets perfectly. Of course en0 is plugged in and all the connections go there.
- Trying it with en0 unplugged. Piki has no default route but has prefix routes for the LAN via br0 and rad9. ip route get 192.9.200.203 puts it on br0 (Geneve) but it could equally go on rad9. Xena can ping piki (br0) and pikiwf (rad9) on IPv6 but not IPv4.
- Outcome of net-assess executing on Piki:
  - All 6 test files/pages were downloaded with a HTTP code of 200 (good) and checksums, when checked, were correct (some pages vary each time and the checksum is skipped).
  - http://jacinth/katamari-star: speed 4.62 Mby/s, good, fully operational.
  - https://www.jfcarter.net:1447/katamari-star: speed 1.57 Mby/s, adequate, speed hit is blamed on HTTPS crypto overhead.
  - 4 onsite and offsite short pages were downloaded successfully. Speed ranged from 137 Kby/s to 329 Kby/s and the download time ranged from 0.012s to 0.265s. Plausible causes of the slower speeds: connection and/or HTTPS setup for the shorter payloads; capacity limits at the server.
  - Conclusion: This configuration is working as it should!
- Selen connects to Piki and plays katamari-star.
  - Wi-Fi connection was OK with no hassle getting the IP correct address.
  - Playing video (katamari-star) from Jacinth using VLC: completely normal.
  - Steam Train with VLC: This is a big file with some fancy encoding for 3D glasses, and it hogged the aether; Xena pinging Piki got 100% packet loss, and fairly radical loss pinging Beaver and Holly, meaning Xena's packets were being lost. Video was in segments (buffering).
  - Wave Surfers with VLC: MPEG-2 51Mby in 16sec, which also overloads the aether but not as bad as the Steam Train.
  - Morgen (moon-set and sunrise) with VLC, similar to the previous two.
  - Big Buck Bunny with VLC: H.264, 725Mby in 10min. Net speed was not quite enough to keep up with playback speed. Ping packet loss (Xena to Piki) was about 50% at the worst times (better than less efficient other formats) but there was no noticeable interference with other ping partners.
  - This is the first time I've gotten the video test files to actually work.

Deploying Piki

To finish before deploying Piki:
- For debugging, hostapd.J on Beaver and Piki are disabled. These needs to be enabled. [Done]
- Piki needs ip route add default via 192.9.200.193 dev br0. With a default route for IPv4, Xena can now ping piki and pikiwf and the answers will come back to Xena. Piki gets an IPv6 default route (on both rad9 and br0) by Router Advertisements on the LAN. Piki's IPv4 address is static; if it were DHCP the default route would come in that way. [Done]
- On Beaver and Piki it ought to be possible to put an AP and a managed interface on the same NIC, thus avoiding collisions between packets sent from different NICs. But both services have to be on the same channel.
- Piki's SSID is currently CouchNet-Piki. For testing with the Ring camera call it CouchNet-EXT. Then change back to CouchNet-Piki and reconfigure the camera to use that SSID.
- Disconnect Piki's en0 for testing.
- Need to test if Piki and Beaver will come up cleanly when rebooted.
  - Beaver: rad9 came up, but not rad0, due to hardware weirdness. When rebooting Beaver try doing poweroff, then physically power cycle. Didn't help. Also reloading hostapd didn't help. You have to stop hostapd.J, remove the NIC, plug in again, and start hostapd.J again. Then it comes up clean. Blecch.
  - Despite pikiwf going up and down while straightening out Beaver, Piki ended up correct without any more reboots or restarts. Including the formerly missing default route.
  - beaverwf doesn't answer ping6 from Xena (only ping4). It has a default route via br0 to Jacinth per Router Advertisements: correct, but why don't Xena ping6 go through? There are LAN prefix routes (IPv6) of equal (medium) preference via br0 and rad9. Beaver's route via rad9 doesn't actually go to the LAN, it goes only to the clients on Piki. Actually Piki is [supposed to be] bridged by Geneve back to Beaver, so traffic on that route ought to make it through. Investigate later.
- I switched Piki's SSID to CouchNet-EXT and Ring camera 2 connected. It's chattering with the mother ship and successfully sends out video: seeming fully operational, no communication problem. This is with Piki next to Iris, not in the laundry room. Here's the signal strength reported by each camera:
  - ringc1 to Holly, RSSI -57dBm, battery 89%
  - ringc2 to Piki, RSSI -79dBm (pretty bad), battery 46% (Piki near Iris)
  - ringc3 to Holly, RSSI -50dBm, battery 53%
- Moving Piki to the laundry room: From Beaver's location, Piki:rad7 (the access point) has a signal of -79dBm, which is pathetic. In the opposite direction, iwconfig rad9 also reports a signal level of -76dBm, which is also pathetic. Xena (through Beaver) can't successfully ping Piki, and I couldn't shut down Piki by systemctl; I had to just turn off power.
- Moving Piki to the piano: From Beaver, Piki:rad7 signal is -70dBm. Piki:rad9 reports Beaver's signal as -71dBm. The Ring camera is chattering with the mother ship, but with unreliable packet delivery and slow responses to user queries. This isn't going to fly either.
- Moving Piki near the rice cooker: From Beaver, Piki:rad7 signal is -46dBm. Piki:rad9 reports Beaver's signal as -62dBm, which is not wonderful but can be used; bit rate is estimated as 14.4Mbit/s. The camera reports RSSI -65dBm, barely acceptable. But the rice cooker belongs in that space.
- Moving Piki near the dishwasher: from Beaver, Piki:rad7 signal is -62dBm. Piki:rad9 reports Beaver's signal as -63dBm and data rate is 5.5Mbit/s. These are acceptable, barely. Camera reports RSSI -70dBm, barely acceptable. I think this is going to be Piki's new home.

Signal strength survey with Piki in various locations. Column headings:

rad7: signal strength (dBm) of Piki's access point, the Za-pai Ralink (USB 148f:5370), chipset 5390/5370, rt2800usb driver, measuring beacons at Beaver's location.
rad9: signal strength (dBm) of Piki's onboard NIC connecting to Beaver's rad9 AP.
Ring: signal strength reported by Ring camera 2 connected to Piki's AP (rad7).

Location	rad7 dBm	rad9 dBm	Ring dBm	Notes
Laundry	-79	-76	--	No communication
Piano	-70	-71	--	Camera barely works
Rice Cooker	-46	-62	-65	Works, speed 14.4Mbit/s
Dishwasher	-62	-63	-70	Intermittent signal loss
Curio Cabinet	-50	-58	-62	Works, speed 6.4Mbit/s

First the good news: Piki (near the dishwasher) seems to be fully operational. Sometimes. Ring camera 2 connects to it and does normal stuff with the mother ship. Xena can do SSH to piki and pikiwf. But testing with ping (IPv4+6) from Xena, much of the time its packet loss rate is low (1% loss or less), but it gets into episodes of high loss (over 50%) and also gets into a catatonic state from which it does not recover; it does send beacons at -57dBm to -66dBm, but no ping answers or any other commmunication. In one case that was timely observed, onset was sudden from prior low packet loss state, and both pikiwf (rad9) and piki (rad7) dropped out at the same time. This lets out bugs in rad7's firmware as the culprit, but availability of Piki depends on pikiwf, so issues on rad9 could cause the observed symptoms. In another incident it recovered from 100% packet loss to 90%, but after a few seconds it went back to 100%.
I put a signal monitor on Piki:rad9 and let it run overnight. It's reporting the signal seen by Piki when connecting to Beaver, every 1.4 secs, plus whether Jacinth (via Beaver) responded to ping4. Here's a summary of a histogram of signal strength.
- Total samples: 34617.
- Starting at 22:11, the first ping failure was at 09:39. Failures were scattered, sometimes two together, under 20% of samples, until 10:15.
- At 10:15 the signal dropped to -78dBm and ping always failed. Signal declined to -81dBm. At 10:36 it suddenly jumped to -72dBm and pings started succeeding. The signal went up and down, and at -76dBm and worse, pings would fail. Actually around 12:18 the threshold was more like -74dBm. At 13:40 the signal jumped from -73dBm (success) to -78dBm (fail) and continued failing to 14:06 (end of test) despite a normal range of signal strengths between -67dBm and -80dBm. About 1% of successes had a signal strength of -76dBm or worse, even one at -85dBm. About half of the faiiures were similar while the other half had higher signals which usually allowed a success.
- Average signal: -71.6dBm.
- The range of signal strengths was -65dBm to -85dBm (1 each).
- The plurality value was -71dBm.
- The range enclosing about 90% of values was -70dBm to -74dBm.
- Normal signal strength (beacons and signal monitor output) is seen during periods of non-response.
Conclusion: for Piki:rad9 to Beaver:rad9 (both onboard NICs), for the dishwasher location the signal strength varies from -65dBm to -85dBm with -71.6dBm being the average and -75dBm being the minimum for consistent succesful transmission. (But one success at -85dBm was seen, plus others on low signal.) So I should try another location. I wish I knew why the signal strength varies so much.
The new location atop the curio cabinet is a big improvement, and is going to be the final move. On a monitor run, Piki failed to ping4 Jacinth 11 times out of 37577, in 3 clusters (at 21:50, 09:52, 10:23) all having reduced signal strength.

Three days later:

Wi-Fi service was switched over to Beaver, including moving the Alfa AWUS036ACM to Beaver (and Terow AC1200 to Holly). Net interfaces:
- Beaver br0: 192.9.200.198 + c6 (beaver), contains rad1, en0
- Beaver Geneve: off (would be gen1 in br0)
- Beaver Geneve AP: rad9, solo, onboard, 192.9.200.201 + c9, beaverwf; it would be the Geneve AP but this is not active.
- Beaver AP: rad1 (Alfa AWUS036ACM) in br0, MTU 1500
- Beaver uplink: en0 in br0, BOGUSLY has an IP: 192.9.200.198 + c6 (beaver, same as br0)
- Holly br0: 192.9.200.199 + c7, contains rad0 rad9 gen1 en0
- Holly Geneve: gen1, in br0, MTU 1500
- Holly Geneve AP: rad9, BOGUSLY in br0, MTU 1500, 192.9.200.205 + cd (hollywf)
- Holly AP: rad0 (Terow) in br0, MTU 1500
- Holly uplink: en0 in br0, MTU 1500,
- Piki: powered off, but if it were on it would have:
- Piki br0: 192.9.200.207 + cf (piki), containing gen0 rad7 and sometimes en0
- Piki Geneve: gen0 in br0
- Piki AP: rad7 (Za-pai) in br0, MTU 1500
Testing: when Selen or Xena connect to Beaver or Holly, they can perform videos perfectly (Katamari Star, Wave Surfers, Big Buck Bunny). General use is normal for either. Except…
At irregular intervals, Beaver gets scrambled. I can't be sure if Holly behaves the same. Tested by: Xena connected (Wi-Fi) to Beaver (mostly) or Holly (sometimes). Xena pings beaver + beaverwf + holly. Packet loss rate to all three is similar. Most of the time the packet loss rate is 0.1% or less. It increases variably to 50%-90% lost, lasting 1 to 2 minutes. This didn't happen during testing with videos. (Not saying that videos immunized it; rather, the packetvore did not attack during testing.) Usually 15-30 min passes between attacks, but sometimes as low as 5 min.
Next step is to fix bogosities and retest: Beaver en0 IP, and Holly rad9 Geneve uplink being in br0.
- Beaver en0: As of 21:00 it has no IP, i.e. self-healed. Its start mode is auto; bootproto is static; IP addrs are explicitly 0.0.0.0/32 and ::/128. So how did it get an IP earlier?
- Holly rad9: /etc/hostapd/802.11-Geneve.conf loads generic.incl which puts all AP's in the bridge, which is bogus in this case. Same on Beaver. I'm moving bridge=br0 from generic.conf to 80211n-CN.conf and 80211n-Holly.conf, and omitting completely in 802.11-Geneve.conf, and similarly on Beaver (for which Geneve is currently turned off).
- Changeover at 21:15; how long will it do OK? Switching Selen to Holly. Playing video (Katamari Star, Wave Surfers, Big Buck Bunny). Perfect, only one ping packet lost. Switching over to Beaver. Ooo, packetvore strikes exactly when Wave Surfers starts, for 16 sec, but no problem after that. Repeat test, 1 packet lost. Similar packetvore attacks were seen after test was over.
Starting up Piki. (About 21:31:00) It did not come up. Moving it so I can hook up en0. rad9 ESSID is CouchNet-Geneve, which is sending beacons (per WiFi Analyzer on Selen), but rad9 doesn't associate. Holly sees it associate and immediately dissociate, every 30 sec. Piki sees ringc2 do a complete association on its AP, then it dissociates after 15 sec, and retries after 30 sec. I'm assuming that ringc2 tried and failed to get a DHCP address because Piki's uplink was hosed. Similar behavior has been seen on Selen in the past but it actually says it couldn't get an address.
How I'm leaving it: Piki powered off. Beaver; hostapd.J running, geneve.J disabled. Holly: both hostapd.J and geneve.J disabled. This at 22:33. At 22:48 there had been 2 ping packets lost; no packetvore. The packetvore showed up at 22:51; this time it is a total outage for 3 min (a record). Lasted to 22:59, then came back without any interventions.
Now switching back to only Holly at 23:09. So far so good.

Collecting info and planning how to gather more info about what's wrong.

Overnight, Piki was powered off, Beaver was awake but hostapd.J and geneve.J were disabled+stopped. Holly hostapd.J was running, SSID was CouchNet, but no geneve.J . Bearer NIC (rad9) had no carrier because the Geneve AP on this NIC was disabled. Holly has the Terow NIC. No (known) error in Wi-fi service.
Next plan: a 4-way contest: Beaver vs. Holly and Alfa AWUS036ACM vs Terow AC1200. With sketchy records, I think all four are going to avoid the packetvore. But I need to re-test with defininte records. I'll have Piki's Wi-Fi uplink (normally bearing Geneve packets) to connect to CouchNet and monitor when it goes down. Per past experience there should be plenty of dropouts in a 2-3 hour test. en0 will be connected, for access to Piki when Wi-Fi goes out.
Xena can now ping pikiwf (and pikien), IPv4+6. Packet loss is moderately good.
The tester is called ~jimc/bin/net-assess-wifi , currently on Piki. Design outline:
- ip -4 route show | grep rad ; identify the device, in $IFC
- TGT=jacinth (what we're going to ping) (use static IP)
- Every 30 secs do: ping -c 10 -i 0.2 -I $IFC -n -q $TGT
- Relevant output: 10 packets transmitted, 10 received, 0% packet loss… Extract the number received.
- Whenever the number received is under a limit, report it.
- Also report the signal level. iwconfig $ifc and look for e.g. Signal level=-57 dBm. See Piki's signal.sh .
- Idea: I should also report the number of packets in the last 30 sec, to detect a packet storm.
Outcomes: In all cases the listed combination was the only access point active, and no Geneve. On both hosts, rad9 was up but not associated with anything (normally has the Geneve AP, now disabled).
- Holly + Terow AC1200: Start 18:00, 137 min, no packetvore.
  Start 12-05 21:45. about 3 min after starting, the system went into a massive packetvore attack, with loss rates almost always over 50% and 90% at least half the time. Did not revert until a reboot. Wi-Fi Analyzer on Selen reports CouchNet beacons at -45 dBm, and none of my other APs were transmitting. Neighbors were all worse than -75dBm. So it's (probably) not caused by neighbors. Still going on at 22:07; I rebooted Holly. After reboot it's back to normal. For about 1 min, then choppy packet loss, lasting 2 min, then returned to zero packet loss. Next at 22:14:20 I stopped hostapd.J, un+replugged the Terow, and restarted hostapd.J. Almost zero packet loss. Going to bed.
  Start 12-07 23:30 after giving up on Beaver + Alfa AWUS036ACM. This time it ran for 39.5 hours with 4 incidents of 8/10 pings received and 1 of 9/10 pings, until the test was terminated. Is holly+Terow perfect? Next test will be Holly+Alfa.
- Beaver + Alfa AWUS036ACM: 62 min, no reports produced, but there was about a 5 min interval of choppy packet loss, likely not reaching 40%, and Alice complained of a Wi-Fi problem. For overnight, switched back to Holly + Terow.
  Giving Beaver + Alfa another chance. With an improved tester. Starting at 12-06 17:00, summary at 20:45, 2 incidents with more than 2 of 10 ping packets lost. The number of packets received per 30sec was not significantly different from the surrounding error-free tests, indicating no packet storm. In the one that I saw on another monitor, the packet loss rate was elevated for 15-30sec and then self-healed. I'm leaving this to run overnight. Tomorrow I'll try swapping the NICs. Oops, at 20:55:30 it went into high loss rate mode including several stretches of 100% loss. Stopping and starting hostapd.J brought it back to normal. However, 2 min later it had a 90 sec stretch of irregular high packet loss. More of the same at 21:39. This isn't going to fly overnight, reverting to Holly + Terow.
- Holly + Alfa AWUS036ACM: Started 12-09 14:00. At 16:00, perfect performance. At 17:00, 1 8/10 ping fault. Continuous monitoring shows very few packets lost. Stopped test at 1800, switching to Beaver+Terow.
  Started again 12-09 22:20 to test Alfa overnight. As of 12-10 15:00 it had run for 16hr 20min and had one 8/10 ping fault.
- Beaver + Terow AC1200: Started 12-09 18:06. Until 21:15 (3+hr) there was one ping fault, 7/10 received, and another caused by switching over to Beaver (doesn't count). Continuous monitoring shows very few packets lost. 22:11:30 and 22:12:00 had 2 separate major glitches, 1/10 and 0/10. Switching to Holly+Alfa for overnight.
Final conclusion: The packetvore lives in Geneva. If I can get wired Ethernet to Piki, and if I forget about a range extender using Geneve, I can get Wi-Fi working right and can terminate this time sink.

Photo and Image Credit

Raspberry Pi 4B Wi-Fi