Files
cproudlock 18537acbbc PXE server: fix WinPE re-image SMB connection loss
WinPE clients re-imaging the same machine hit "System error 53 -
network path not found" on the second attempt. systemctl restart smbd
did not help; only a full server power cycle cleared the state.

Root cause is kernel nf_conntrack: the default TCP ESTABLISHED timeout
is 5 days (432000s), so a session from the first WinPE run whose
client rebooted abnormally leaves an ASSURED ESTABLISHED entry that
ufw's state-tracking rules then mis-classify the new SYN against.

Fix applied in three layers:
- /etc/sysctl.d/99-pxe-conntrack.conf drops TCP ESTABLISHED timeout
  to 1 hour and shortens the half-closed states to 30s each.
- smb.conf gains socket options TCP_NODELAY SO_KEEPALIVE IPTOS_LOWDELAY
  plus keepalive = 30 and deadtime = 5. Active sessions refresh the
  conntrack timer every 30s via keepalives so they never age out;
  dead ones expire in an hour.
- /usr/local/sbin/smb-diag.sh snapshots kernel + Samba state for
  remote diagnosis; /usr/local/sbin/smb-soft-reset.sh walks a
  progressive recovery (nmbd/smbd restart, conntrack flush, arp
  flush, ss -K) as an alternative to power-cycling.

conntrack package added to download-packages.sh and playbook verify
list so the offline .deb bundle ships with it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 13:00:43 -04:00

86 lines
2.2 KiB
Bash
Executable File

#!/bin/bash
# smb-diag.sh - snapshot Samba + kernel network state so a future failure
# can be diagnosed remotely. Run this on the PXE server BEFORE power-cycling
# when a WinPE re-image client is getting "cannot connect" errors.
#
# Output: /tmp/smb-diag-<timestamp>.log (pastebin-friendly)
#
# Captures: smbd processes, open SMB sessions, port 445 TCP sockets,
# conntrack, arp, bridge fdb, dnsmasq leases, recent smbd logs.
set -o pipefail
TS=$(date +%Y%m%d-%H%M%S)
OUT=/tmp/smb-diag-$TS.log
exec > >(tee "$OUT") 2>&1
echo "=============================================================="
echo "SMB diagnostic snapshot - $(date)"
echo "=============================================================="
echo
echo "### uptime / kernel ###"
uptime
uname -r
echo
echo "### interfaces + bridge state ###"
ip -brief addr
echo
bridge link show 2>/dev/null
echo
bridge fdb show 2>/dev/null | head -30
echo
echo "### smbd process tree ###"
pstree -p $(systemctl show -p MainPID --value smbd 2>/dev/null) 2>/dev/null
echo
ps -eo pid,ppid,state,command | grep -E 'smbd|nmbd' | grep -v grep
echo
echo "### systemctl status ###"
systemctl is-active smbd nmbd dnsmasq apache2
echo
echo "### smbstatus ###"
smbstatus 2>&1 | head -40
echo
echo "### port 445 sockets ###"
ss -tnp 2>/dev/null | grep :445
echo
echo "### conntrack entries for PXE subnet ###"
if command -v conntrack >/dev/null 2>&1; then
conntrack -L 2>&1 | grep -E '10\.9\.100' | head -30
echo "total conntrack entries: $(conntrack -C 2>&1)"
else
echo "conntrack tool not installed"
fi
echo
echo "### arp / neighbour table for PXE subnet ###"
ip neigh show 2>/dev/null | grep -E '10\.9\.100|br-pxe'
echo
echo "### dnsmasq DHCP leases ###"
cat /var/lib/misc/dnsmasq.leases 2>/dev/null | head -20
echo
echo "### recent smbd log files ###"
ls -la /var/log/samba/ 2>/dev/null | head -20
echo
echo "### recent smbd auth / status errors (all machine logs) ###"
grep -hE 'NT_STATUS|error|denied' /var/log/samba/log.*.log 2>/dev/null | tail -30
echo
echo "### last 20 lines of smbd master log ###"
tail -20 /var/log/samba/log.smbd 2>/dev/null
echo
echo "=============================================================="
echo "Snapshot saved to $OUT"
echo "=============================================================="