Perform Access Gateway health check
Performing a timely AGW health check is essential to confirm that an AGW is in good operating state, has no failures or errors and to proactively resolve any occurring issues. It is recommended to perform this operation (in addition to regular checks) before and after any changes in node (and compare) to make sure there are no undesired effects of any activity.
Affected components: AGW, enodeb, Orchestrator
Connectivity
-
Login to the AGW by running below the command in terminal:
ssh magma@<IP of AGW> -
Checking Magma interfaces. Make sure eth0 and eth1 are UP.
ip addr
Enodeb Connection
-
Check S1 and SGi interfaces can ping eNodeB(s) and internet respectively.
ping google.com -I eth0ping <enodeB IP> -I eth1 -
For managed eNB check status of eNodeB(s) attached to gateway using the cli(skip this step for unmanaged eNB):
sudo enodebd_cli.py get_all_statusAn eNodeB in good state, looks similar to the below:
magma@magma:~$ enodebd_cli.py get_all_status
--- eNodeB Serial: 120200004917CNJ0028 ---
IP Address..................10.0.2.243
eNodeB connected....................ON
eNodeB Configured...................ON
Opstate Enabled.....................ON
RF TX on............................ON
RF TX desired.......................ON
GPS Connected.......................ON
PTP Connected......................OFF
MME Connected.......................ON
GPS Longitude..............-106.347936
GPS Latitude.................35.608135
FSM State...............Completed provisioning eNB. Awaiting new Inform. -
Check eNodeB at SCTP level by taking a TCP dump. There should be a heartbeat messaging between eNB and AGW IP.
sudo tcpdump -i any sctpA sctp association in good state looks similar as below:
magma@magma:~$ sudo tcpdump -i any sctp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
06:59:06.045369 IP 10.0.2.243.36412 > 10.0.2.242.36412: sctp (1) [HB REQ]
06:59:06.045521 IP 10.0.2.242.36412 > 10.0.2.243.36412: sctp (1) [HB ACK]
06:59:07.534188 IP 10.0.2.242.36412 > 10.0.2.243.36412: sctp (1) [HB REQ]
06:59:07.544183 IP 10.0.2.243.36412 > 10.0.2.242.36412: sctp (1) [HB ACK]
Magma Services
-
Check all gateway services and their status by running below commands.
sudo service magma@* statussudo service scptd statusservice openvswitch-switch status-
Make sure that all services are in “active (running)” state and there are no errors in any service.
-
Consider the "active (running)" duration aligns with the AGW being running, this will give you an idea of unexpected restart of services. See example below, service running for
3 min and 35s. -
Verify memory doesn't reach the limit assigned to the service. See example below, memory used is 113.9M out of 512M.
magma@mme.service - Magma OAI MME service
Loaded: loaded (/etc/systemd/system/magma@mme.service; disabled; vendor preset: enabled)
Active: active (running) since Tue 2021-09-07 13:18:51 UTC; 3 min and 35s ago
Process: 7732 ExecStartPre=/usr/bin/env python3 /usr/local/bin/config_stateless_agw.py reset_sctpd_for_stateful (code=exited, status=0/SUCCESS)
Process: 7617 ExecStartPre=/usr/bin/env python3 /usr/local/bin/generate_oai_config.py (code=exited, status=0/SUCCESS)
Main PID: 7854 (mme)
Tasks: 28 (limit: 4915)
Memory: 113.9M (limit: 512.0M)
CGroup: /system.slice/system-magma.slice/magma@mme.service
└─7854 /usr/local/bin/mme -c /var/opt/magma/tmp/mme.conf -s /var/opt/magma/tmp/spgw.conf -
-
Check the status of OVS module with
sudo ovs-vsctl show. Make sure the “is_connected” states are “true” and there are any port errors.OVS in good state looks similar to the below:
magma-dev:~$ sudo ovs-vsctl show
e2bf2cb0-7bbe-48ef-a489-3341731685e1
Manager "ptcp:6640"
Bridge "uplink_br0"
Port "uplink_br0"
Interface "uplink_br0"
type: internal
Port patch-agw
Interface patch-agw
type: patch
options: {peer=patch-up}
Port "dhcp0"
Interface "dhcp0"
type: internal
Bridge "gtp_br0"
Controller "tcp:127.0.0.1:6633"
is_connected: true
For further debuging steps, you can follow the AGW Datapath debugging guide.
Orchestrator Interface
-
Verify connectivity with Orchestrator by running below command:
checkin_cli.py.An AGW connection with Orc8r in good state, looks similar to the below:
magma@magma:~$ checkin_cli.py
1. -- Testing TCP connection to controller.magma.test.io:443 --
2. -- Testing Certificate --
3. -- Testing SSL --
4. -- Creating direct cloud checkin --
5. -- Creating proxy cloud checkin --
Success! -
Verify in syslogs AGW is perioridically checkin in to Orc8r. Around every minute you should see this message
Checkin Successful! Successfully sent state. Syslogs can be found in/var/log/syslog -
Verify if AGW was successfully checkin in NMS(Show as "Good" Health)
For further debuging steps, you can follow the AGW Unable to checkin to Orc8r.
Subscribers
-
Check subscribers attached using the below command.
sudo mobility_cli.py get_subscriber_tableIf users are unable to attach to the network, you can follow the Use unable to attach guide.
-
Check if subscribers are not dropping packets due to Magma. Follow the AGW Datapath debugging guide.
Performance
-
Check CPU utilization.
top. If it is high, check which process is utilizing CPU more from output of the same command. All processes are listed there. -
Check memory utilization by running the same command as above. You can also verify by service using the command
ps -o pid,user,%mem,command ax | sort -b -k3 -r
Metrics
Login to NMS UI. From the left side menu options, select “Metrics”. Check various metrics that are available. Look for any sudden spike or degradation that may indicate issues with the system.
- Number of Connected eNBs (Grafana -> Dashboards -> Networks)
- Network of Connected UE (Grafana -> Dashboards -> Networks)
- Network of Registered UE (Grafana -> Dashboards -> Networks)
- Attach/ Reg attempts (Grafana -> Dashboards -> Networks)
- Attach Success Rate (Grafana -> Dashboards -> Networks)
- S6a Authentication Success Rate (Grafana -> Dashboards -> Networks)
- Service Request Success Rate (Grafana -> Dashboards -> Networks)
- Session Create Success Rate (Grafana -> Dashboards -> Networks)
- Upload/Download Throughput (Grafana -> Dashboards -> Gateway)
Note: Number of sites(enodeb) down, users affected, and outage duration are key indicators of service impact.
Optional Features
Make sure you test any other feature that is applicable to your network
- X2 Handover
- S1-Flex
- Inbound Roaming
- External DHCP
- UE Bridge Mode