Version: 1.8.0

Metrics

Overview

Magma gateways and orc8r generate a lot of metrics which provides a great deal of visibility into gateways, base stations, subscribers, reliability, throughput etc. These metrics are regularly pushed into prometheus, which along with grafana enables us to store and query for these metrics. All the metrics are stored in prometheus for a default of 30 days. For unlimited retention and a more scaled metrics pipeline, we also support deploying magma with Thanos.

Metrics Explorer

Metrics explorer provides an easy way to learn and explore the metrics available in our system. Metrics explorer can be viewed through NMS. We can search and filter the metrics by the name or description. Additionally if we click on the detailed view on the metric, we enable exploring the current trends on the metric via grafana explorer. Metric Explorer1 Metric Explorer2

Grafana

Grafana provides a powerful, configurable, and user-friendly dashboarding solution. Any users within an organization can create and edit custom timeseries dashboards that will be visible to all other users in their organization. An important detail is that Grafana access is limited only to users in an organization with the "Super-User" title (you will select this when provisioning users in an organization). This is a technical workaround to ensure that users with additional network visibility restrictions within an organization can't see information from networks that they are restricted from as Grafana will allow all users to query across any network that the organization owns.

When you click on this link we have to do some book-keeping on the backend, so the initial load may take a few seconds. Grafana homepage You’ll see built in dashboards available to you. Grafana variables These dashboards contain dropdown selectors to choose which network(s) and gateway(s) you want to look at. In Grafana you can look at any collection of networks or gateways your organization has access to at once. Simply select or deselect the networks/gateways that you want to see and the graphs will be updated. In the top right corner, there is an option to choose the time range that the graphs display. The default is 6 hours.

Custom Dashboards

With Grafana, you can create your own custom dashboards and populate them with any graphs and queries you want. These custom dashboards will be visible to all other users in the organization that you belong to in the NMS. The simple way is to just click on the “+” icon on the left sidebar, then create a new dashboard. There is ample documentation about grafana dashboards online if you need help creating your dashboard.

Grafana new dashboard

Grafana documentation on creating dashboards: Grafana Dashboards
Prometheus documentation on writing queries: Prometheus Querying

If you want to replicate the networkID or gatewayID variables that you find in the preconfigured dashboards, we provide a “template” dashboard to make that easy. Simply open the Template dashboard, and click on the gear icon near the top right. From there, click “Save As” and enter the name you want. Your new dashboard will now have the gatewayID and networkID variables. An example of how to use these variables in your queries: Grafana query Some technical details: You need to use =~ when matching label names with these variables in order to see more than one network or gateway at a time. This is because the =~ operator tells Prometheus to match the value as a regex.

Enabling Access

The feature flag is enabled by default for all new organizations created in the NMS. If you want to turn this feature off or on, you can do so from the host organization. Log in to the host organization, navigate to the feature flag page using the left sidebar, then edit the feature flag named "Include tab for Grafana in the Metrics page". Support can be turned on and off for individual organizations.

List of metrics which are currently available

Up-to-date view would be available through metrics explorer

Metric Identifier	Description	Category	Service

s1_connection	eNodeB S1 connection status	eNodeB	MME
user_plane_bytes_ul	User plane uplink bytes	eNodeB
user_plane_bytes_dl	User plane downlink bytes	eNodeB
enodeb_mgmt_connected	eNodeB management plane connected	eNodeB	enodebd
enodeb_mgmt_configured	eNodeB is in configured state	eNodeB	enodebd
enodeb_rf_tx_enabled	eNodeB RF transmitter enabled	eNodeB	enodebd
enodeb_rf_tx_desired	eNodeB RF transmitter desired state	eNodeB	enodebd
enodeb_gps_connected	eNodeB GPS synchronized	eNodeB	enodebd
enodeb_ptp_connected	eNodeB PTP/1588 synchronized	eNodeB	enodebd
enodeb_opstate_enabled	eNodeB operationally enabled	eNodeB	enodebd
enodeb_reboot_timer_active	Is timer for eNodeB reboot active	eNodeB	enodebd
enodeb_reboots	eNodeb reboots counter	eNodeB	enodebd
rcc_estab_attempts	RRC establishment attempts	eNodeB	enodebd
rrc_estab_successes	RRC establishment successes	eNodeB	enodebd
rrc_reestab_attempts	RRC re-establishment attempts	eNodeB	enodebd
rrc_reestab_attempts_reconf_fail	RRC re-establishment attempts due to reconfiguration failure	eNodeB	enodebd
rrc_reestab_attempts_ho_fail	RRC re-establishment attempts due to handover failure	eNodeB	enodebd
rrc_reestab_attempts_other	RRC re-establishment attempts due to other cause	eNodeB	enodebd
rrc_reestab_successes	RRC re-establishment successes	eNodeB	enodebd
erab_estab_attempts	ERAB establishment attempts	eNodeB	enodebd
erab_estab_failures	ERAB establishment failures	eNodeB	enodebd
erab_estab_successes	ERAB establishment successes	eNodeB	enodebd
erab_release_requests	ERAB release requests	eNodeB	enodebd
erab_release_requests_user_inactivity	ERAB release requests due to user inactivity	eNodeB	enodebd
erab_release_requests_normal	ERAB release requests due to normal cause	eNodeB	enodebd
erab_release_requests_radio_resources_not_available	ERAB release requests due to radio resources not available	eNodeB	enodebd
erab_release_requests_reduce_load	ERAB release requests due to reducing load in serving cell	eNodeB	enodebd
erab_release_requests_fail_in_radio_proc	ERAB release requests due to failure in the radio interface procedure	eNodeB	enodebd
erab_release_requests_eutran_reas	ERAB release requests due to EUTRAN generated reasons	eNodeB	enodebd
erab_release_requests_radio_conn_lost	ERAB release requests due to radio connection with UE lost	eNodeB	enodebd
erab_release_requests_oam_intervention	ERAB release requests due to OAM intervetion	eNodeB	enodebd
pdcp_user_plane_bytes_ul	User plane uplink bytes at PDCP	eNodeB	enodebd
pdcp_user_plane_bytes_dl	User plane downlink bytes at PDCP	eNodeB	enodebd
ip_address_allocated	Total IP addresses allocated	AGW	mobilityd
ip_address_released	Total IP addresses released	AGW	mobilityd
s6a_auth_success	Total successful S6a auth requests	AGW	subscriberdb
s6a_auth_failure	Total failed S6a auth requests	AGW	subscriberdb
s6a_location_update	Total S6a lcoation update requests	AGW	subscriberdb
diameter_capabilities_exchange	Total Diameter capabilities exchange requests	AGW	subscriberdb
diameter_watchdog	Total Diameter watchdog requests	AGW	subscriberdb
diameter_disconnect	Total Diameter disconnect requests	AGW	subscriberdb
dp_send_msg_error	Total datapath message send errors	AGW	pipelined
arp_default_gw_mac_error	Total errors with default gateway MAC resolution	AGW	pipelined
openflow_error_msg	Total openflow error messages received by code and type	AGW	pipelined
unknown_pkt_direction	Counts number of times a packet is missing its flow direction	AGW	pipelined
enforcement_rule_install_fail	Counts number of times rule install failed in enforcement app	AGW	pipelined
enforcement_stats_rule_install_fail	Counts number of times rule install failed in enforcement stats app	AGW	pipelined
network_iface_status	Status of a network interface required for data pipeline	AGW	pipelined

subscriber_icmp_latency_ms	Reported latency for subscriber in milliseconds	AGW	monitord

magmad_ping_rtt_ms	Gateway ping metrics in milliseconds	AGW	magmad
cpu_percent	System-wide CPU utilization as percentage over 1 second	AGW	magmad
swap_memory_percent	Percent of memory that can be assigned to processes	AGW	magmad
virtual_memory_percent	Percent of memory that can be assigned to processes without the system going to swap	AGW	magmad
mem_total	Total memory	AGW	magmad
mem_available	Available memory	AGW	magmad
mem_used	Used memory	AGW	magmad
mem_free	Free memory	AGW	magmad
disk_percent	Percent of disk space used for the volume mounted at root	AGW	magmad
bytes_sent	System-wide network I/O bytes sent	AGW	magmad
bytes_received	System-wide network I/O bytes received	AGW	magmad
temperature	Temperature readings from system sensors	AGW	magmad
checkin_status	1 for checkin success, and 0 for failure	AGW	magmad
bootstrap_exception	Count for exceptions raised by bootstrapper	AGW	magmad
unexpected_service_restarts	Count of unexpected service restarts	AGW	magmad
unattended_upgrade_status	Unattended Upgrade status	AGW	magmad
service_restart_status	Count of service restarts	AGW	magmad
enb_connected	Number of eNodeb connected to MME	AGW	MME
ue_registered	Number of UE registered successfully	AGW	MME
ue_connected	Number of UE connected	AGW	MME
ue_attach	Number of UE attach success	AGW	MME
ue_detach	Number of UE detach	AGW	MME
s1_setup	Counter for S1 setup success	AGW	MME
mme_sgs_eps_detach_indication_sent	SGS EPS detach indication sent	AGW	MME
sgs_eps_detach_timer_expired	SGS EPS Detach Timer expired	AGW	MME
sgs_eps_implicit_detach_timer_expired	SGS EPS Implicit detach Timer expired	AGW	MME
mme_sgs_imsi_detach_indication_sent	SGS IMSI detach indication sent	AGW	MME
sgs_imsi_detach_timer_expired	SGS IMSI detach timer expired	AGW	MME
sgs_imsi_implicit_detach_timer_expired	SGS IMSI implicit detach timer expired	AGW	MME
mme_spgw_delete_session_rsp	SPGW delete session response	AGW	MME
initial_context_setup_failure_received	Initial context setup failure received	AGW	MME
nas_service_reject	NAS Service Reject	AGW	MME
mme_s6a_update_location_ans	S6a Update location	AGW	MME
sgsap_paging_reject	SGS Paging Reject	AGW	MME
duplicate_attach_request	Duplicate attach request	AGW	MME
authentication_failure	Auth failure	AGW	MME
nas_auth_rsp_timer_expired	NAS Auth response timer expired	AGW	MME
emm_status_rcvd	EMM status received	AGW	MME
emm_status_sent	EMM status sent	AGW	MME
nas_security_mode_command_timer_expired	NAS security mode command timer expired	AGW	MME
extended_service_request	Extended service request	AGW	MME
tracking_area_update_req	Tracking area update request	AGW	MME
tracking_area_update	Tracking area update success	AGW	MME
service_request	Service request	AGW	MME
security_mode_reject_received	Security mode reject received	AGW	MME
mme_new_association	New SCTP association	AGW	MME
ue_context_release_command_timer_expired	UE context release command timer expired	AGW	MME
enb_sctp_shutdown_ue_clean_up_timer_expired	SCTP shutdown UE clean up timer expired	AGW	MME
ue_context_release_req	UE context release request	AGW	MME
s1ap_error_ind_rcvd	S1AP error indication received	AGW	MME
s1_reset_from_enb	S1 reset from eNB	AGW	MME
nas_non_delivery_indication_received	NAS non delivery indication received	AGW	MME
spgw_create_session	SPGW create session success	AGW	MME
ue_pdn_connection	UE PDN connection	AGW	MME
ue_pdn_connectivity_req	UE PDN connectivity request	AGW	MME
ue_reported_usage / up	Reported TX traffic for subscriber / session in bytes	AGW	sessiond
ue_reported_usage / down	Reported RX traffic for subscriber / session in bytes	AGW	sessiond
ue_dropped_usage / up	Reported dropped TX traffic for subscriber / session in bytes	AGW	sessiond
ue_dropped_usage / down	Reported dropped RX traffic for subscriber / session in bytes	AGW	sessiond

REST API for querying metrics

api

Troubleshooting Metrics

On the gateways, magmad service collects metrics from all the services and pushes them to Orc8r. In Orc8r, metricsd receives the metrics and pushes them to registered metric exporters. Prometheus is one the main metric exporters. Specifically, Orc8r pushes the metrics to the edge-hub, which later scraped by prometheus instance.

On the query side, When we make queries through NMS or swagger, the Orc8r queries the prometheus instance directly.

We can effectively troubleshoot the metrics by looking at the logs in all these components involved. On the gateways, syslog might have error logs in case there is a failure during metric upload

ERROR:root:Metrics upload error! [StatusCode.UNKNOWN] rpc error: code = Unavailable desc = client_loop: send disconnect: Broken pipe

On the orc8r, we can debug the issues by dumping the logs on prometheus, prometheus-configmanager, prometheus-cache and metricsd

kubectl --namespace orc8r logs -l app.kubernetes.io/component=prometheus-configurer -c prometheus-configurer
kubectl --namespace orc8r logs -l app.kubernetes.io/component=prometheus -c prometheus
kubectl --namespace orc8r logs -l app.kubernetes.io/component=metricsd
helm --debug -n orc8r get values orc8r

Overview​

Metrics Explorer​

Grafana​

Custom Dashboards​

Enabling Access​

List of metrics which are currently available​

REST API for querying metrics​

Troubleshooting Metrics​