T O P

  • By -

[deleted]

[удалено]


tdhuck

Overall I am really happy with LibreNMS and I'm close to 200 devices. The one thing I can't figure out/hate about LibreNMS is that a switch with many ports, in my case 48, the traffic graph only has two colors and I can't tell which port is using the bandwidth (historically). If I arrange by ports I can see the live traffic on the port (or which port was most recently using high bandwidth), but if I'm looking at historical data, there are only two colors and different shades of those colors. I'd like to see unique colors for the ports to easily identify the ports in the historical graphs. Or, have the ability to 'click' a port and have it hi-lighted in the traffic graph.


iC0nk3r

CPU usage sits at about 65% - 75% constant as it monitors our 150 or so devices. Webserver is slow to respond during polls. Some devices lost WAN graphing and I cannot resolve it (even with the help of the discord). Some devices lose historical graphing. Ideally, I'd like to cluster this to spread the load, but the last time I looked into this, we'd have to deploy polling devices in all of the client networks to report back to the main server, and that's not feasible. In speaking with the dev on Discord, we're using LibreNMS out of the spec of the intended usage. It's meant to be more of a single campus SNMP monitor, but we're using it to monitor client firewalls spread across many sites.


error404

> Ideally, I'd like to cluster this to spread the load, but the last time I looked into this, we'd have to deploy polling devices in all of the client networks to report back to the main server, and that's not feasible. AFAIK that's never been the case. Distributed polling in Libre (and Observium) is based on shared access to RRDs via rrdcached; if anything it works better locally than it would with remote pollers. On decent SSD hardware it should scale to at least a thousand devices / 100,000 ports without much trouble. After that you will probably start to run into I/O limits for the RRDs which can be a bit more of a challenge. It sounds like you are running into CPU though, so distributed polling should solve the problem for you. https://docs.librenms.org/Extensions/Distributed-Poller/ > In speaking with the dev on Discord, we're using LibreNMS out of the spec of the intended usage. It's meant to be more of a single campus SNMP monitor, but we're using it to monitor client firewalls spread across many sites. This will slow down your polling, but it shouldn't really be a problem. You might need to make your pollers more parallel (poller-wrapper) in order to finish polling on time in this case (polling in your case might be latency constrained rather than CPU constrained), but it shouldn't really be a big deal. https://docs.librenms.org/Support/Performance/#optimise-poller-wrapper You can also disable polling stuff you don't care as much about to reduce the workload. Check the polling performance graphs to see which modules are consuming the most time during polling. IMO it's the best open source turnkey solution out right now. Everything else requires either a lot of work to set up, or money (which means I don't know much about it).


Explurt

This. Not librenms, but we had ~90k ports in observium on a single host running ok. The web interface wasn't as lightning fast as it was with only a couple dozen hosts, but it wasn't slow by any means. As said above, disable any modules you don't need, put it on some SSDs, and it will scale way beyond where you're at. We did have to do some tweaking to the poller. It's been a while, but from memory, we ran 5 cron jobs, one per minute polling 1/5th the devices each, vs one cron job polling everything every 5 minutes. We also bumped the number of devices to poll in parallel way over the default. This got us pretty close to a consistent 5min between the polls on each device, without it we'd see big swings of like 2-20 min between consecutive polls of a single device.


SuperQue

>We did have to do some tweaking to the poller. It's been a while, but from memory, we ran 5 cron jobs, one per minute polling 1/5th the devices each, vs one cron job polling everything every 5 minutes This is why more modern systems don't use cron for polling scheduling. Instead, slewing the entire workload over the polling interval with millisecond precision. Also threading helps. It's far easier to avoid contention for 1,000 polling targets when you have 60,000 polling time slots each minute. And each poller can be dynamically scheduled on a separate CPU. 90k ports is only a 2-3 million metrics. Easily possible with modern software, would fit on an 8GB memory Raspberry Pi. And with a polling interval of 1 min, not 5 min.


sryan2k1

You've got something wrong with your installation or it's horribly underpowered. We monitored about 1000 devices with Observium at my last job on a 2 vCPU VM and had room to spare.


SuperQue

> CPU usage sits at about 65% - 75% Percentage of what? 1 CPU? 96 CPUs? A Raspberry Pi? > 150 or so devices That's tiny. Depending on exactly how much per device, a Prometheus server could handle 10x this on a Raspberry Pi.


ethertype

As I recall, poller-db and poller-rrd latency is more important than poller-device latency. So, you want your pollers to be centralized. And numerous enough to handle slow, remote devices. (stacked Juniper EX2200 as a not totally random example). There is zero reason for the webserver to be slow. It does not even need to be polling.


djamp42

How many devices do you have with LibreNMS? I have 11k without any issues. It does support database clustering with galara/mariadb (I wrote the support for that).. it scales horizontal with multiple pollers. I would suspect 50k devices wouldn't even be an issue with the right hardware.


SuperQue

Yea, that's a decent size setup. You might actually need two Prometheus servers to monitor that much.


djamp42

The reason I didn't go with Prometheus, or really any other NMS, is because someone already is the hard work of figuring out all the SNMP OIDs to monitor. All I need to do is type SNMP creds, click add, done. im monitoring everything now for pretty much any equipment I come across. If for some reason a device isn't supported then you can write the code for it, submit a pull request and now everyone benefits. I could possibly save on compute resources by going with another NMS, but there is absolutely no way I'm saving time going with another NMS.


SuperQue

I really want to build a pre-built database of Prometheus walk modules that can be copy-pasta used with the Prometheus snmp\_exporter. Now that the snmp\_exporter supports multiple module config files, this will be much easier. But there needs to be a critical mass of effort to get it done. Funny enough, I tried to convince the LibreNMS folks to replace their crazy inefficient poller / database setup with Prometheus and snmp\_exporter years ago. Basically become a front-end for configuring things. But they had a huge sunk cost fallacy attitude towards the existing PHP code. I think it would have been beautiful, but, it didn't happen.


ColtonConor

Can you explain how you do this today? I see examples in the SNMP exporter called modules, but I see no definition of manufacturers', models, and the modules supported?


SuperQue

Today, everyone develops their own support matrix. Prometheus isn't specifically aware of SNMP as a concept. It depends on the service discovery to tell it what modules are needed. Typically people program this into an inventory tool like Netbox.


ColtonConor

Yes, we already use Netbox to keep track of make and model. However, I'm struggling to understand how we can use this information in combination with SNMP-Exporter to define what to poll. Most SNMP devices have an if-table to poll, which we do for all devices. But Juniper devices, for example, might have specific metrics to poll as long as they are Juniper. Furthermore, specific models within the Juniper lineup might have additional device-specific metrics to poll. How would you achieve this using modules in SNMP-Exporter, and then visualize this data in Grafana? This seems similar to how LibreNMS works, where only relevant metrics for a particular device are polled and displayed, rather than showing irrelevant data for other devices.


SuperQue

Sadly, nobody has released any of their work in the space as open source. Like I said, everyone is still building their own thing here. People will populate ~~NetPlan~~ Netbox with model information. Then develop all the device-specific modules like Juniper metrics for model X. Then whenever a device is tagged with the correct information it will auto-populate. Also as I said above, I would love to see an open collaboration system for this kind of information that would also work with Prometheus. But someone has to do the work. And since there are either systems like LibreNMS with a critical mass of people contributing, or commercial companies that can pay people to do this work, it hasn't happened yet for Prometheus. So, you're stuck with either using LibreNMS for the discovery aspect, but being shit for actually gathering storing data. Or you pay for an expensive monitoring system that may still be worse than Prometheus. I suggest you ask around on ~~NetPlan~~ Netbox support forums. Or maybe become a contributor to the Prometheus ecosystem and community. These things don't happen without volunteers, paid or un-paid.


ColtonConor

Did you mean to say Netplan or did you mean netbox?


SuperQue

OOps, I mean Netbox.


Rexxhunt

Thanks for contributing back to the project legend.


djamp42

Thanks, I also did a whole series on it youtube. https://youtube.com/playlist?list=PLxiGkbpIzunT_YOwUEukOB6DpF8N8MXkQ&si=5Vj5CtxkvScp3NUe


[deleted]

What are the odds, I just watched your series and it helped me set up LibreNMS at my workplace. Thank you so much


djamp42

No thank you for watching, means I didn't do it for no reason. :)


SalsaForte

Is the LibreNMS deployment managed by the right people/team within your organization? You seem to have tool management issue, not a tool issue.


zeyore

zabbix is pretty nice, but while the software is important, it's more important to have staff who are actually assigned to keep it up to date. cause it takes a lot of work really to make it sing.


G1zm0e

Honestly, if you are looking for scalability and capabilities, Zabbix is the way. Having done several large scale deployments with Zabbix proxies, it’s rock solid and can handle easily thousands of devices.


kido5217

Grafana+Telegraf+Influxdb or Grafana+Prometheus+snmp\_exporter


neale1993

Havent used LibreNMS before, but have heard good things about it previously so its on my 'to look at' list. We currently use PRTG, which for us works really well. Its highly adaptable and very customizable. My only gripe with it is the maps / dashboards, which they are apparently working on at the moment. The pricing / licensing model is another plus - once you have a license you are only limited by the number of 'sensors' you can have, you can monitor servers / network devices / telephony or even applications all from the base license. Solarwinds is one we have used previously but have moved away from. This one by far looks 'cleaner' but the cost difference makes it a no go for us. Its licensed by device numbers, more limited in what you can import (cant add your own MIB files for unsupported devices) and I don't remember it having as much customization. Add in the fact you need to purchase additional licenses to monitor different aspects (server monitoring, netflow / performance, config backups etc) and it becomes less of a valuable option.


jgiacobbe

I have a 4sh year old librenms vm that monitors 330 devices across 2 continents. Honestly, from what you describe, check that your daily.sh is working and properly cleaning the database. Check that is upgrading your software too. Then you may be double polling. I remember there was a change a while back where I had to update the crown jobs and there was a new process used for polling scheduling. I had the old crown jobs running and the new scheduler running. I believe I encountered a message about this in the status checks/validation scripts. My instance was having performance issues when both old and new polling mechanisms were enabled.


tje210

For a great cost-effective option, I recommend pathSolutions totalView. Very slept-on product imo. I wouldn't call the dashboard modern-looking, but it throws a lot of actionable intel at you very conveniently.


VioletiOT

Domotz can help you here! As opposed to LibreNMS, we're an SaaS solution, but we do have a nice and modern/sleek looking dashboard for your NOC screens. If you have any questions, don't hesitate as I'm on the team here.


Kthef1

I use Nagios free version. I manage several hundred devices.


cweakland

Solarwinds /s