docssetuptroubleshooting

Troubleshooting

A symptom-keyed index for things that surface during Periscope installs, upgrades, and day-2 operation. Most feature-specific issues are documented inline with the feature — this page indexes them so an operator can g…

A symptom-keyed index for things that surface during Periscope installs, upgrades, and day-2 operation. Most feature-specific issues are documented inline with the feature — this page indexes them so an operator can grep by what they saw, plus adds cross-cutting items (chart-versions OOM, network changes, scanner false-positives) that don't fit a single feature doc.

If you don't find your symptom here, the most diagnostic single file is the Periscope pod's slog stream — every privileged action gets a structured line, and every request carries an X-Request-Id that grep-joins to the matching audit row.


Quick lookup

SymptomWhere
"Open Shell" button missing on a pod pagepod-exec.md §8
Shell button missing on a cluster page headercluster-shell.md §9
WebSocket upgrade fails / instant disconnectpod-exec.md
E_FORBIDDEN on cluster-shell clickcluster-shell.md §9
Shell session pod stays Pending >30scluster-shell.md §9 — also see below
Helm command not in cluster_shell_close.commands[]Upgraded? On v1.1.5+ helm is audited; pre-v1.1.5 it isn't (CHANGELOG)
Node shell button missing on a node pagenode-shell-ssm.md §7
Node shell AccessDenied on StartSession (document/...)node-shell-ssm.md §7 — split the policy into two StartSession statements
Node shell preflight: node not Online / E_NODE_NOT_EC2node-shell-ssm.md §7
Node shell AccessDenied on AssumeRole though aud looks rightnode-shell-ssm.md §7 — trust policy can't gate on groups
E_REAUTH_REQUIRED opening a node shellnode-shell-ssm.md §7 — id_token stale, sign in again
Tier user gets 403 on everythingcluster-rbac.md
groups claim missing / emptyokta.md / auth0.md
Login bounces to IdP foreverokta.md / auth0.md
Refresh fails after N hoursokta.md / auth0.md
Callback URL mismatch / Sign-in redirect URI mismatchokta.md / auth0.md
Audit page shows nothing / auditEnabled: falseaudit.md
X-Audit-Scope: self for an admin useraudit.md
Audit DB grows past maxSizeMBaudit.md
SSE drops every ~60 secondswatch-streams.md §7
Watch stream returns 404 immediatelywatch-streams.md §7
Watch stream cap reachedwatch-streams.md §7
Page updates feel polled, not livewatch-streams.md §7
EKS "Drift not computed"eks-upgrade-readiness.md
EKS Upgrade Insights "load failed"eks-upgrade-readiness.md
Agent registration tls: certificate requiredagent-onboarding.md
Agent registration SPKI pin mismatchagent-onboarding.md
Agent registration registration rejected (HTTP 401)agent-onboarding.md
Agent connects, then drops every few secondsagent-onboarding.md
Cluster card stuck "unreachable"agent-onboarding.md
Periscope pod OOMKilled when listing chart versionsbelow
Artifact Hub flags HIGH/MEDIUM CVE on a clean imagebelow
Agent tunnel reconnect churn after the central LB changes IPbelow
Local microk8s: TLS cert error after switching networksbelow
Agent can't connect after a full reinstall (connection refused)below

Production gotchas

Periscope pod OOMKilled when listing helm chart versions

Symptom: SPA fetches /api/clusters/{c}/helm/chart/versions against a public helm repo with a large index.yaml (prometheus-community, bitnami, ingress-nginx, etc.) and the Periscope pod restarts. kubectl describe shows Last State: Terminated, Reason: OOMKilled, Exit Code: 137.

Cause: the handler reads the entire index.yaml into memory and yaml.Unmarshals it in one shot. For prometheus-community (5–10 MB index.yaml), the burst hits ~600–700 MiB RSS — well past the chart default resources.limits.memory: 512Mi. The pod is OOM-killed mid-request; ingress-nginx surfaces this as a 503.

Fix: bump the memory limit while a streaming-parse follow-up lands. 1Gi is comfortable headroom for every public helm repo we've benchmarked:

# values.yaml
resources:
  requests:
    memory: 256Mi
  limits:
    memory: 1Gi

Apply with helm upgrade periscope ... --reuse-values --set resources.limits.memory=1Gi.

Notes:

  • The 5-minute chartVersionsCache only helps the second request — first fetch after pod start still pays the full burst.
  • Affects all backends equally (in-cluster, agent, eks, kubeconfig).
  • Steady-state Periscope RSS is ~200 MiB; the bump is for the chart- versions burst alone.
  • The streaming-parse fix tracks under the helm-chart-versions endpoint refactor — once it lands, the 1Gi bump can revert.

Cluster-shell pod stuck Pending behind an outbound proxy

Symptom: Operator clicks the shell button. The drawer opens, the WebSocket connects, but the terminal sits at "starting…" for

30 seconds. kubectl -n periscope-system describe pod periscope-shell-* shows Failed: ErrImagePull / ImagePullBackOff against ghcr.io/gnana997/periscope-shell:<version>.

Cause: the cluster's container runtime can't reach ghcr.io — typically because the cluster is air-gapped, the egress runs through a corporate proxy that doesn't allow ghcr.io, or the proxy needs auth not configured on the kubelet.

Fix: mirror the shell image into the registry the cluster can reach, then point the chart at it:

# Server chart (values.yaml)
clusterShell:
  enabled: true
  image:
    repository: registry.internal.acme.com/periscope/periscope-shell
    tag: v1.1.5  # match your installed periscope version
    pullPolicy: IfNotPresent

# Agent chart (per managed cluster) — only needed if you use the
# agent backend and the kubelet on managed clusters also can't reach
# ghcr.io. Same value, applied separately because the agent and
# server can be sized for different network postures.

The shell image is opt-in (default clusterShell.enabled: false), so operators who don't use cluster shell never pull it. Once mirrored, operators on locked-down clusters get the same in-browser experience without punching a hole through their egress policy.

Notes:

  • The clusterShell.podStartTimeoutSeconds (default 30s) determines how long Periscope waits before returning E_SHELL_POD_TIMEOUT — bump it if your mirror is slow on first pull, but the underlying issue is registry reachability, not timeout tuning.
  • The shell image is multi-arch (linux/amd64 + linux/arm64), so a one-time crane copy or equivalent into your mirror works for both.

Agent tunnel reconnect churn after the central LB rotates IP

Symptom: The agent log on a managed cluster shows a steady cadence of:

agent.dial_attempt   server=wss://...:8443/api/agents/connect
agent.dial_failed    err=...
agent.reconnect_in   seconds=8

The central server's cluster registry shows the cluster as "unreachable" or flickering between connected/unreachable.

Cause: the agent's serverURL was set at install time. If the LB / Ingress IP behind that URL changes — e.g. AWS NLB rotates a node IP, a hostname-based serverURL DNS-cache TTL expires and resolves new, or the central Periscope LB was reprovisioned — there's a brief window where the agent's cached DNS / TCP connection is wrong.

Why it usually self-heals: the agent's reconnect supervisor (internal/tunnel) backs off [0, 1s, 3s, 8s] with a max 30s ceiling and re-dials from scratch each cycle. DNS re-resolves on each dial. A 1-2 minute window of churn is the typical signature of an LB rotation; longer means something else broke.

When to act:

  • Hostname-based serverURL: usually nothing to do; DNS TTL expires and reconnect succeeds. If your DNS TTL is unusually long (>5 min), consider lowering it on the Periscope server's record to shorten the churn window.
  • IP-based serverURL (rare in production — typically dev rigs or temporary topologies): you'll need to helm upgrade the agent with the new IP. After upgrade the agent reads the new value from the rendered env, reconnects on the next dial.

Diagnostic: turn on agent debug logging to see the per-attempt detail (DNS resolution, TCP error, TLS handshake) — see agent-onboarding.md → Turn on debug logging on the agent.


Local-development gotchas

Agent can't reach the central server after a full reinstall on local NodePort

Symptom: After uninstalling the central Periscope helm release and reinstalling it (e.g. testing a fresh-install flow on local microk8s or kind), the agent on a managed cluster crashloops with:

"err":"state: registration: POST register:
  Post \"http://192.168.0.6:31429/api/agents/register\":
  dial tcp 192.168.0.6:31429: connect: connection refused"

The IP looks right but the port doesn't answer.

Cause: when the central Periscope service is exposed as service.type: NodePort (typical for local-dev rigs), Kubernetes allocates a new random NodePort on every fresh install. The agent's saved serverURL / registrationURL point at the old NodePorts, which the new service doesn't bind. TCP gets refused.

In-place helm upgrade --reuse-values preserves NodePorts because the Service object isn't recreated. Only full uninstall → install triggers re-allocation.

Fix:

# 1. Look up the new NodePorts
microk8s kubectl -n periscope get svc periscope -o yaml \
  | grep -A2 'nodePort:'
# Example output:
#     nodePort: 30733   (port 8080 — HTTP / registration)
#     nodePort: 32167   (port 8443 — tunnel)

# 2. Update the agent values
sed -i \
  -e 's|registrationURL: http://.*|registrationURL: http://<host>:30733|' \
  -e 's|serverURL: wss://.*|serverURL: wss://<host>:32167|' \
  kind-agent-values.yaml

# 3. Helm upgrade the agent — agent reconnects on the new URL
helm upgrade periscope-agent \
  oci://ghcr.io/gnana997/charts/periscope-agent --version 1.1.5 \
  -n periscope -f kind-agent-values.yaml

If the agent had not yet completed registration (TCP refused at the register step), the bootstrap token is still unused — you can keep it. If registration succeeded against the old install before the reinstall, the new central server doesn't have the agent's cert in its registry, so mint a fresh token from POST /api/agents/tokens and re-register.

To avoid: for local-dev rigs that you might re-install repeatedly, pin the NodePort by patching the Service after install:

microk8s kubectl -n periscope patch svc periscope --type=json -p='[
  {"op":"replace","path":"/spec/ports/0/nodePort","value":31429},
  {"op":"replace","path":"/spec/ports/1/nodePort","value":31410}
]'

The chart doesn't currently expose service.nodePorts.* overrides (values issue on the backlog). For production, use a LoadBalancer or Ingress — NodePorts would be cluster-internal only and not referenced from the agent values, so this issue doesn't apply.


microk8s apiserver TLS cert mismatch after switching networks

Symptom: After moving your laptop between networks (wifi → hotspot, office → home, etc.), kubectl port-forward against a local microk8s cluster fails with:

error: error upgrading connection: error dialing backend: tls: failed
to verify certificate: x509: certificate is valid for 192.168.0.6,
172.17.0.1, 172.18.0.1, ..., not 172.20.10.3

The new IP (172.20.10.3 here) is wherever your laptop landed after the network swap.

Cause: this is a snap-microk8s + mobile-laptop interaction, not a Periscope issue. Three ingredients combine to produce it:

  1. microk8s's apiserver binds to 0.0.0.0 (every interface, including ones that come and go with your network).
  2. The apiserver TLS cert was minted at install time with the host's IPs at that moment — your wifi IP became a SAN, but your hotspot IP didn't exist yet.
  3. microk8s runs an interface-watcher daemon that rewrites your kubeconfig's server: line when it sees a new primary IP. So kubectl now dials the new IP, the apiserver answers, and Go's x509 verifier rejects the cert.

Production clusters don't combine these three. Cloud-managed K8s (EKS, GKE, AKS) reach apiserver via stable DNS, cert covers the DNS name. kind binds to localhost only, never touches your host IPs. k3s defaults to localhost-only kubeconfig. Only snap-microk8s combined with a moving laptop trips all three.

Fix (non-destructive, ~30 seconds of apiserver downtime, no workload disruption):

sudo microk8s refresh-certs -e server.crt

This regenerates the apiserver cert with current host IPs in the SAN list. Pods keep running through the brief apiserver restart; kubelets reconnect; in-flight kubectl calls retry transparently.

Durable fix (one-time setup, ~10 minutes): pin the apiserver to a stable hostname your network state can't change:

  1. Add a hosts entry: echo "127.0.0.1 periscope-dev.local" | sudo tee -a /etc/hosts
  2. Edit /var/snap/microk8s/current/certs/csr.conf.template, add DNS.5 = periscope-dev.local
  3. sudo microk8s refresh-certs -e server.crt
  4. Update kubeconfig's server: line to https://periscope-dev.local:16443
  5. For Periscope-agent installs, set agent.serverURL=wss://periscope-dev.local:31410 (or whatever your tunnel port is)

After this, you can swap networks freely without breaking kubectl, helm, or the agent tunnel.

Note for cluster operators: the durable fix is what production clusters do anyway (stable hostname, cert pinned to the hostname). A laptop dev rig is the only place this gets tricky.

See also: local-dev.md if it exists, otherwise this entry is the canonical place.


Periscope-binary-side: making slog the diagnostic surface

Almost every entry on this page can be confirmed (or ruled out) by grepping the Periscope pod's slog stream for matching keys. The binary writes structured JSON; relevant fields per scenario:

ScenarioWhat to grep
OOM during chart-versionshelm chart versions (handler name) + check pod RestartCount
Cluster-shell pod won't startcluster_shell.pod_create, cluster_shell.pod_wait_ready
Node shell won't openssm_shell.* (handler), node_shell (startup config line); the STS/SSM AccessDenied is in the audit ssm_session_open failure + the client error frame's message
Agent tunnel issuestunnel.*, agent.session_event, agent.dial_*
OIDC / auth issuesauth.*, look for groups field on the line
Watch streamswatch.stream_open, watch.stream_close
Audit pipelineaudit:, especially around startup

All log lines carry an X-Request-Id that the audit pipeline also records on the corresponding audit row — one ID grepped across the two surfaces gives a complete end-to-end trace.

For agent-backed clusters, the same request_id appears in both the central server's stdout AND the agent's stdout (when agent.logLevel: debug), so an end-to-end trace across three logs is one grep of the same UUID.


See also

FileWhat's there
pod-exec.md §8Pod exec session troubleshooting (4 entries)
cluster-shell.md §9Cluster shell troubleshooting (8 entries)
agent-onboarding.mdAgent registration + tunnel troubleshooting
audit.mdAudit pipeline issues
watch-streams.mdSSE watch stream issues
eks-upgrade-readiness.mdEKS Upgrade Insights / AMI drift
cluster-rbac.mdTier mode / RBAC binding mistakes
okta.mdOkta-specific OIDC mistakes
auth0.mdAuth0-specific OIDC mistakes