CREN - Network Management: Tools and Practices

TechTalk | Virtual Seminars | Glossary

Campus Communication Strategies Transcript

Network Management: Tools and Practices

Ken Klingenstein
Director of Computing and Network Services
University of Colorado, Boulder
ken.klingenstein@colorado.edu

In this section of the seminar, we'll look a little bit more closely at some of the tools used for network management as well as identify the best practices being done by leading institutions in managing their campus networks.

If you recall, in the first part of this seminar, we described network management as a layered set of services with a physical core, a set of network software that needs to link that physical core together, a set of enterprise-wide services that provide the glue on the network software that allow users to successfully navigate the networks. Cutting across those three layers is the need for a comprehensive organization and operational staff to put this all together.

At the physical core, there are a number of tools for fiber and cable management, tools to fuse fibers and splice them; indeed, fiber management has become a much more tractable task over the last several years. Network analyzers and SNMP probes provide a logical view into the physical network. We talked earlier about how our spare strategy has moved from keeping full routers on the shelf to having components of routers on the shelf. When a router fails, we may need to slide in new cards. There may be instances where we need to replace the backplane itself, or the power supply. In those instances, where we really have to insert a new router, we need to make sure the logic that drives that router, the stored configurations of firmware, is in place in the new router to replace the one that was broken. Lastly, the network management system should provide access into the inventory and database of components in your environment.

As we said in the first part of the seminar, SNMP is an extremely useful tool for managing out networks. There are only five commands in the first version of SNMP. A Get Request command asks a network device to provide information. In turn, that network device initiates the response with a Get-Response command. The other three commands include a Get-Next request that allows a Get-Request to be followed by a short command to get the next entry in a table. The Set Request is initiated by the network manager to set or reset values on the network device. And finally, the Trap command is issued by a network device when some abnormality or event occurs.

More precisely, it is the agents within these network devices that initiate and respond to commands. They can initiate commands through the Trap mechanism or they can respond to commands by being polled for data. In those instances where a network device does not speak native SNMP, there is a need to install a proxy agent that in turn will translate the SNMP request into the dialect of that network device. The same proxy agents can be used as a gateway between other network management protocols and SNMP. Lastly, a proxy agent may cache management data so that when a request comes in for that data, it can be responded to in a timely fashion.

SNMP asks agents to fetch values from their Management Information Databases. We've gone through two early generations of MIBs; MIB I and MIB II. Typically a MIB will store data about the interfaces connected to a network device, the Internet protocol, some of the control protocols for the Internet, and, as you can see, other protocols that may be operated on that network device.

If we look into a particular class of SNMP, we may find, for example, within the IP class, objects such as the default Time to Live setting for IP packets, a counter that registers the number of packets forwarded by that device, how many packets came from an upper level application for that device to put out. We may also find counters that measure the number of packets that were dropped by a device due to resource limitations. We've talked about subnetting as a wonderful tool for network management. You can inquire from an object what its subnet mask is. There's also information about routing and other considerations.

When one queries a MIB, one gets back very basic information. As you can see, by issuing a command on a gateway box to ask what interfaces were up, this list of interfaces was presented. To move from this kind of information into a graphical display is essential so that an operator can quickly see what's up and what's down without having to navigate large datasets.

Given the limitations of SNMP, a second version of SNMP has been proposed. Unfortunately, this version, which provides lots of useful characteristics, has been stuck in standardization processes for quite some time. But when it finally is released, it will provide the security that was missing in SNMP version 1. It will give us a command such as GETBULK, that will allow us to efficiently request large amounts of data versus issuing a series of Get and Get Next commands. It will allow a network device to have multiple agents that may respond to multiple network management stations. SNMP v2 also provides a hierarchical structure that allows a network device to have a chain of management stations above it at its command. SNMP v2 also offers protocol analysis for more than the IP and TCP protocols; for example, it includes counters that measure IPX traffic and AppleTalk.

Unfortunately, the move to switches in our network environment does not work well with SNMP. Switches hide traffic. Switches direct traffic between two adjacent networks without it becoming visible to SNMP. Because switches are optimized for huge amounts of volume, if we ask switches for the information that they do have, we may get overwhelmed. Switches build virtual networks; SNMP works off of physical networks. They may be a mismatch, and it may be difficult to identify the data that we're seeing relative to the logical environment we have set up with the switch. Lastly, switches are very busy. Asking them to perform SNMP operations in addition to switching large amounts of packets may tax the CPU of the switch too much.

Scanning the landscape of the leading institutions, we see a number of trends out there in managing the physical core. One trend is to convert shared EtherNets into switched EtherNets. This extends the capacity of the subnets a great deal. Secondly, leading universities are entering into explicit agreements with their users over the levels of service that are going to be provided. This is a key component of managing expectations. Thirdly, there is a new generation of network management systems being developed that are more integrated than the previous generations, that have the hooks into the inventory databases, that link well with trouble ticket systems. These network management systems are also open, allowing modules to be dropped in for new technologies as those technologies become present on the campus. There is now work being done to develop trouble ticket systems that can be exchanged between adjacent networks so that, as one does a diagnosis of a problem, one can pass the trouble ticket from the campus system into the external network provider. As indicated earlier, inventory and management need to be tied together.

Looking more deeply at the tools involved in managing the network layer, we want to look carefully at RMON and RMON 2 and at subnetting. One of the other major developments at this point is to create Domain Name Servers that are different for internal users from external users. By doing this, we may direct an external user into a different Web server, for example, a Web server that is oriented towards providing admissions information and other information of more concern to the external community than to the internal community. Similarly, we can keep directories and other kinds of campus-based information internal by only placing that on an internal Web server and using Domain Name Service to distinguish the two. Security practices are ongoing business at the network layer. Much of the challenges to security are increasing each day; so should the practices. Lastly, a very useful tool to managing Local Area Networks is to place performance monitors on the server. These kinds of performance monitors cannot only watch network traffic on a LAN more closely, but can identify bottlenecks that are more a result of server capacity issues than they are of network capacity issues.

We see now that we have two tools that occupy similar niches; RMON and network analyzers. Network analyzers need to be dispatched. They're not resident on site. Network analyzers have no standard set of measurements. What is measured and the definition of that measurement is left to the vendor. Network analyzers may have advanced tools not available in the RMON environment. Network monitors have network-specific components that can be added; for example, an FDDI monitor can be added to a general purpose network analyzer. Network analyzers have high price tags and are usually deployed after some kind of network problem has become evident. RMON, on the other hand, has a more basic set of tools. Because they're more basic, it's lower cost. RMON probes can be permanently installed on a subnet and provide continuous monitoring. They are generally dependent of the technology that they're connected to, and RMON can provide historical data over some lengthy period of time, unlike network analyzers, which are ad hoc mechanisms.

RMON can provide not only host-specific data, but network-specific data. They can provide traffic analysis and statistics by network, by host, or by connections between two hosts. Much as network analyzers, RMON devices can capture packets. They can have filters that help you identify only those packets that are relevant to the problem being diagnosed. The packet decoding of the packet into a higher level interface is done at the management station. That keeps the cost of the RMON probe itself very low. One can install RMON as these dedicated, low-cost probes or one can install RMON into existing network devices such as hubs and routers. Because RMON is a busy monitoring tool, it can be processor-intensive. Installing RMON into a workstation which is also trying to do meaningful end-user work is ill-advised. Because RMON can capture large amounts of data, it can be disk-intensive as well.

Inside an RMON probe, you will find a number of databases being stored. You may find MIB II there. MIB II is looking at the interfaces connected to a device, some system variables, and perhaps capturing some of the protocols being used. In addition, probes tend to have private MIBs developed by the vendor of the probe. This may help configuration of the probe as well as providing some value-added or proprietary measurements. You can also find on a typical probe an RMON table that would capture information about host transmissions, filters, packet captures, etc., and you may find in a token ring environment, a token ring monitoring database that will capture statistics relevant to the token-ring technology.

The RMON MIB contains a number of important statistical areas. One is interface statistics; packets per interface, octets transmitted per interface, number of errors per interface. It will also look at some of the particular problems that occur in EtherNets, such as collisions, runts, and jabbers. A second RMON group is the history group. It maintains the statistics per interface over some period of time. This is configurable by the management station so that we can ask for a history that is a rolling history over the last six hours, 24 hours, or last several weeks and retain exactly the level of information that we want. A third major MIB group within the RMON protocol is the alarm group. This is a set of thresholds which generate events and traps. We can generate those events for almost any variable that the device is capturing, and we can look at either an absolute threshold (that is, an event which is triggered by a certain value) or a delta value. We can trigger our alarm based upon a change in value. We can build alarms that trigger on the basis of matching a certain packet; build alarms that trigger on the basis of a statistical counter reaching a certain value. Typically, what an alarm will do is trip an event.

In order to have the alarms work meaningfully, we want the data capture to be restricted to a particular set of packets. The filter group is used to match data packets against our predefined definition and initiate the trigger mechanism. Once we have triggered, we want to begin to capture packets. The RMON group packet capture area allows us to capture those packets to fit within a certain predefined buffer size, and to only capture certain packets once the trigger has hit. The event group drives our response to an alarm. It works with traps and permits us to turn on and turn off events and to log the events as they occur.

Turning our attention now to some of the better practices being used out there in institutions for managing the network layer, one clear value is in building redundancy into our key servers. One builds redundancy into Domain Name Servers, perhaps into the core Web server, into some of the routing servers, on the basis of their importance in our environment. Secondly, it is very useful to provide common space for the public good, a place where we will deposit site software, standard shareware that we want to promote on campus, etc. In doing this, it helps the user to be able to access LAN software using LAN tools and network software using network tools. For example, to use a "drag" metaphor to pull a file from a software server onto a client in order to address LAN operating system needs; whereas using an FTP or a Web browser download to download the plug-ins for the Web browsers themselves. A third useful trick in this arena is to concentrate on training and outreach. One of the best things we can do to make our networks manageable is to make our users manageable, and training is an important component of that. Another key aspect of this quality control is to control the varieties of versions of a particular software which may be out there, to make sure that we have only current versions available to users, and that those are installed for consistency. Looking at routing today, many of us have aged our routing protocols from RIP into OSPF, and many of us are now managing our external gateways through BGP-4. Lastly, we should all pay attention to the development of Ip v6. IP v4 is the current Internet protocol, and while it has served us in wonderful stead over 25 years, it is finally showing its age. In particular, IP v4 does not have sufficient address space to accommodate future needs. It does not have the hooks for quality of service that we may need. IP v6 is now in test-bed mode, having been recently completed as a standard. Universities will need to pay attention to the development of IP v6 and to develop migration strategies for their campus as v6 becomes more prominent.

Underlying client-server architecture is a mechanism called Remote Procedure Calls, or RPCs. Clients identify the procedure that they need through use of a unique ID on that procedure. At that point, a program that is running on a client will initiate a call across the network using some runtime systems to access the server. Client-server architectures provide a great deal of efficiency, help users to see a common interface, but place significant load upon a network and require some careful design.

Climbing the protocol stack now to the enterprise-wide level, we want to look a little bit more at what determines a good network management system. We want to look briefly at the distributed computing environment, a comprehensive approach to providing enterprise-wide security, file systems, and printing. As we mentioned earlier, directories, listservs, and universal email are also tools which are best managed at the enterprise level. At the enterprise level, it is also useful to develop firewalls and other security tools to preserve the sanctity of the campus network against outside intrusion. Site licenses are a very powerful way to induce users to use standard software. As we expand our offerings, we may be confusing our users. User/customer focus groups are an excellent opportunity for users to come together and inform central network management of how network management is being perceived by the end user. Lastly, one important choice to make among our toolset is budget or authority. Free choice is antithetical to efficient support. We either need to confine the use of technology to standards or to increase our budgets to deal with diversity.

Turning our attention to Network Management Systems, most Network Management Systems boast that they will do discovery processes. Unfortunately, discovery processes -- that is, identifying network resources automatically -- have real limitations. We may wind up with our display screen being hopelessly congested. We may wind up with devices being "discovered and displayed," which we deliberately restricted from view. Given the complexity of modern Network Management Systems, it's important to pick one that can be installed easily, one that makes intelligent use of colors, and that allows an operator sitting at a network management station to annotate a particular entry on the Network Management System as needed. We need scripting capabilities to make automatic and routine the functions that we need to perform repeatedly. We need to be able to customize the Network Management System for our environment, and hopefully have it deal with complexity. Lastly, since the Network Management System is responsible for the reliability of the network, we need the Network Management System itself to be reliable.

DCE, the Distributed Computing Environment, is a relatively recent development of an enterprise-wide operating system. It provides multiplatformed network computing and can scale from several machines to several thousand machines. It provides authentication. It gives you encryption at the MAC level of the network. It provides a file system which users sitting on their desktop can access transparently, but that file system may be resident inside the traditional glass house where backups and restores can be effectively done. To accommodate the political environment of a campus, DCE creates cells of autonomous domains and nests them together, and allocates to each cell a set of responsibilities and authorities that is consistent with the overall environment. Unfortunately, DCE is a complex piece of software, over 2,000,000 lines of code in its last release. It is still immature, needing additional enhancements. However, nothing else has emerged over the last several years to fill the niche that DCE is intending to fill.

As we said, security is best dealt with on the enterprise level. Another seminar in this virtual series will focus in great depth on security. For now, it is important to note the relationship between security and network management. Authentication proves that a client is who it says it is. It is important to note that authentication applies not only to users but to processes, that processes need to log in into the environment so that a user knows it is really in contact with the official server. Authorization takes authentication a bit further by saying "a proven user now has these permissions." Those permissions are generally contained in access control lists that match a set of attributes, such as READ and WRITE, to a client. Data integrity confirms that there have been no modifications made to packets in transit across the network. Data privacy insures that no one listened in on those packets and grabbed information such as passwords as they made their way through your network fabric.

If we scan the landscape of higher ed for the best practices, it is clear than an important step is to get standard, comprehensive, widespread gathering of statistics. Those statistics serve many purposes. It helps in planning for the future. It helps in determining network problems. It helps in demonstrating the value of the network to the institution. It gives you a longitudinal history that can be used to motivate additional funds. Users are an important part of the diagnostic process in modern networks. Tools that give users direct access to network management statistics aid this process. At the University of Colorado, we have implemented a service called Hyperhelp which, for example, allows a user with a single doubleclick to see traffic patterns on their subnet for the last hour, the last day, and the last week. That way, when a user says, "Gee, my network feels slow!" they can get data directly into their hands that would help them identify whether it's a local network problem or a campus-wide network problem. As we build structured, hierarchical support mechanisms, it is critical that the LAN managers be certified for competency. It is useful to offer courses to these LAN managers to insure that they have the required skills. Typically, many of these courses are standardized courses that don't reflect local issues, and so it's often good to augment these standardized packages with courses that say, for example, "On our campus, we assign IP addresses through the following tools." It also helps to have standard naming for key services across the institution; to have Email naming, for example, that might be based upon your first and last name rather than on your login name; to have Web services that have a canonical hierarchy of directories, though, in fact, the information may be scattered across multiple servers; to have print queues with standard names that help users figure out just exactly which printer they're addressing.

As we look at the crosscutting issues of organization and operations, it is important that, as our problems become more complicated, our procedures stay straightforward. That said, it is unlikely that the way we've been organized in the last several years will work as people become more reliant on information technology. So one of the tools that we need to consider is restructuring our organizations so that they can better respond to the complexity of our environment. We've talked a lot in this seminar about standards and creating standards. It's also important to decommission standards, to take them off the list as the technology ages. We can't continue to sustain a growing list of standards; we need to prune as well as grow. Lastly, it is very useful to create an overall architecture for information technology, to show staff and users how these pieces to fit together, to help justify the standards that we've created.

The best practices out there today for handling organizational change is to converge the physical layer staff; that is, to bring together all of the people who do wiring. Traditionally, many of us have kept our voice side and our data side separate, and that our data staff was responsible for both design, installation, and maintenance. There are economies of scale to be realized by separating the design work from the installation and maintenance, and having installation and maintenance covered by staff whose primary interest and skill set focus at that level. It is also useful to layer support and standards; that is, the farther out that one goes in network topology, in applications, the fewer the standards, the fewer the central support, so that as we get very far out there into the cutting edge applications or very far out there to the desktop, there is an increasing role for the end user in providing support. Lastly, it is important to support the LAN managers out there in the field. We do this by providing them with privileges that others don't have; for example, the ability to hand out large blocks of IP addresses themselves. We may do this by giving them tools that others don't have, such as utilities to manage disk space inside of institutional file systems. One way or the other, we need to make sure that the LAN managers become our colleagues in the complex business of managing our enterprise networks.

In closing, it has been noted that we are managing large networks with small tools. It has been said that data is not the plural of anecdotes. We are finally stepping forward with tools and data necessary to managed our networked environments. It is important for campus computing leadership to stay abreast of these new tools and new practices as they emerge.

Thank you.

[Top of Page]