AI/ML FOUNDATIONS IN NETWORKING:

It has been a while! I’ve spent the past year or so diving deep into AI/ML techniques – learning about what’s actually happening under the hood with these different algorithms and determining what they can (and can’t) accomplish in the networking world. This personal learning journey culminated in an intensive twelve-week program with MIT on Applied Data Science that concluded with a capstone presentation late December. My thanks to my family for putting up with many late nights and early mornings.

Career-wise I’m still planning on staying firmly within the realm of networking… but the way that we implement and maintain networks has changed dramatically just in the past several years and proper data science techniques are becoming critical. The intelligent implementation of the right algorithm – built with the right dataset – and running in the right place – can make a world of difference when it comes to operating and troubleshooting the network.

DATA IS THE BEDROCK OF AI

Put bluntly, you can not have good AI without solid underlying data.

When it comes to leveraging ML techniques with networking it is not a simple formula of the size of the AP / switch installation base – the effectiveness of the algorithms is dependent on the quality of the data that is fed into the system by the networking hardware.

There are two major data considerations when crafting and training ML models:

  • The number of samples captured (rows)
  • The features of the data that are captured with each event (columns)

Many systems require a large number of samples/rows for training before they can start to provide meaningful results. This is why the number of events is important. It’s not enough to only capture “bad” events; you need to see the good events in the network as well to build a baseline.

In addition to the number of events captured, it’s also important that each event have a rich amount of data available – IE, it’s not enough to simply capture “DHCP Timeout” without additional details. It’s important to gather information on the location, the state of the network at the time, the VLAN associated with the issue, etc – all of these additional features can be used to help determine correlations and commonalities in the network.

To break this into networking terms, consider a physical link in your environment – there are a lot of features associated with each of your connections, including medium type, speed, error rate upstream downstream, negotiation status, and more. What we are often most concerned with as network engineers though is the answer to the age-old question – “is this a problematic cable?” That classification label that is applied to the cable – “Good Cable / Bad Cable” – that is the dependent variable that we want to proactively uncover by examining the value blend of the other attributes (also known as independent variables).

Mist is extremely good at capturing both a high sample rate and a rich amount of features within each and every sample thanks to innovative systems like PACE, the ring buffer, and more. Here’s a sample of the different events that can be captured for a wireless client:

These events have a lot of correlation data included in each entry as well. For example, here is the programmatic output of a DHCP Timeout client event within Mist:

DHCP Timeout:
{
"dhcp_xid": 305958821,
"rssi": -68.0,
"capabilities": "80Mhz/40Mhz",
"bssid": "d420b0f156aa",
"num_streams": 1.0,
"proto": "ac",
"channel": 36.0,
"failure_count": 1.0,
"band": "5",
"ssid": "Mist_IoT",
"status_code": 1.0,
"wlan_id": "326f4ac8-73ec-479c-801e-cf545e06e04b",
"type": "MARVIS_EVENT_CLIENT_DHCP_FAILURE",
"ap": "d420b0f1054b",
"timestamp": 1702932016.256,
"org_id": [TRIMMED],
"pcap_url": [TRIMMED],
"site_id": [TRIMMED],
"random_mac": true,
"text": "Failing DHCP DISCOVER from 0a-dd-61-25-db-ef on vlan 2 with Xid 305958821",
"type_code": 1.0,
"mac": "0add6125dbef"
},

In addition to the key value pairs above, there is also a dynamic PCAP associated with the failure event. There are over 150 states are tracked for each client on the network, including many items that are not directly related to an “event” shown in the list above.

Having this quality of data available to the cloud platform is what really makes the different algorithms effective – and the importance of this foundational data should not be understated. Once the foundational data is in place a properly designed ML platform can start to uncover correlations and provide root cause analysis across the network.

There are many algorithms that keep everything running smoothly in the Mist platform – from LSTM to Decision Trees:

A PURPOSE-BUILT PLATFORM IS CRITICAL:

Having the data in place is an important first step. The next important step is the ease of data accessibility.

Mist leverages a microservices architecture with modern data storage and retrieval techniques. As an example, using Marvis Query Language I can execute an event query within an organization that returns 129M results … in under five seconds. This level of responsiveness isn’t possible without a purpose-built architecture.

In addition, every action and insight available within the GUI is also available within the API.

To get started with mapping API calls and pulling data programmatically within Mist, I highly recommend leveraging these resources:

Juniper also offer self paced training available directly in the dashboard. You can learn how to get started with Python and other tools in the “Mist AI for IT” course in our dashboard:

In the next post we’ll review the implementation of a sample ML algorithm to uncover trends in wireless coverage. Future posts will dive more deeply into specific technologies like Marvis and Shapley.

Leave a comment