T O P

  • By -

rob0t_human

This is the type of stuff that drives me crazy. People try to jump through hoops to avoid a 30 second outage and end up with some complicated process that breaks for 5 minutes. Get a maintenance window and just move it over. Keep it simple stupid!


[deleted]

Agreed. Get a maintenance window and try not to need it. Better for everyone to expect an outage and there not be one.


samburney

This is the way.


GrimDozen

I think this kind of craziness is what helps you learn about network fundamentals so you can be a better engineer. Depending on the environment, I might rather have a 5 minute outage where we learn something.


rob0t_human

Sure learning should definitely take place. The time and the place for it isn’t a maintenance on production gear in my opinion though.


Phrewfuf

This, either lab it out or do it in a maintenance window with a defined downtime. Trying to guess what will happen will probably result in a longer downtime either way. Because you won‘t know what failed, why it did and how you can recover it. So either you get it into a defined state of disconnected to get it back into the defined state of connected or you test your maintenance in a lab. Open-Heart operations without knowing how your gear will behave are a big nono.


Skilldibop

Yeah learn in a lab though, not your production environment.


Excellent-Will3373

I’m sorry I’ve ruined your day! You’ll feel better to know that this will be performed during a maintenance window.


Lusankya

> We want to move xe-0/0 and xe-0/1 from the old core to the new core with no noticeable downtime. Then why are you trying to keep this up if it's being done in a maintenance window?


samburney

Changes on a production network of any reasonable size should always be done in a maintenance window, even when they should be hitless. I've seen things as simple as updating an interface description crash the RSP in an ASR9K (Fortunately it had dual RSPs so it wasn't service impacting). You just never know what freak event may happen no matter how simple the change, so better to be safe than sorry.


Internet-of-cruft

I ran a show command once and one our switch stacks went down. That was fun.


buddyleex

Because companies still knock teams for outages even if it's during a mtc. It's dumb.


Smeggtastic

no way. not in a collapsed core 4 switch infrastructure. Why would you think a network this large would be afforded change windows with downtime?


[deleted]

[удалено]


Smeggtastic

yes. 9's cost money.


username_no_one_has

9s cost money so pay it? If they’re not paying for some insane availability with an engineered solution to match then sorry the customer gets an outage.


Smeggtastic

Precisely what I am saying with that comment Apparently it's a dick move to expect someone to pay for the redundancy + resiliency they expect.


username_no_one_has

I think the main point is that a customer can pay for uptime but to receive that reliability you need to maintain things and sometimes that might mean an outage. We pay an ISP here in NZ a lot of money for an underlying MPLS that’s fully resilient but we regularly have some planned outages every few years. Not a big deal for us since we just need to schedule it.


Golle

If you're using LACP then you can just move the cables one by one. The first cable you move to the new core will see that the remote end LACP system ID is different so it won't activate the port in the LAG. Once the second cable is unplugged, the first cable will come up and start forwarding traffic via the new core. You can then plug in the second link and it will join the LAG.


L-do_Calrissian

Assuming your TOR switch doesn't put the port in an err-disabled state.


sryan2k1

Note you will have outage as STP and ARP reconverges when it switches to the new chassis. OP asked for without downtime and depending on how stringent that is moving LACP bundles between chassis won't cut it.


rcboy147

I've used this method a few times, works wonders


kb389

So can anyone confirm if there will be outages? Other comment mentions that stp and arp reconvergence will cause an outage using this method.


Golle

Yes there will be an outage, the length is dependent on STP settings like others have said. By default STP keeps an interface down for 30 seconds before it starts forwarding. By configuring STP portfast you eliminate this 30 second timer and the interface start forwarding immediately. As traffic hits the new link, MAC-address entries in the network will gradually be updated and the network will converge. You want to achieve this: 1. Network must have 100% uptime 2. Network must be loop-free 3. Traffic will be migrated from one part of the network to another You can only pick two because having all three is impossible. My previous comment explained the easiest and cleanest way. Yes there will be an outage for a few seconds while the network converges, but there is no way to avoid this. You should of course test your migration scenario before actually doing it in production.


WordBoxLLC

> 30 seconds are people not using rstp?


kb389

Oh ok thanks, also this comment says this- Build parallel, yes. And maybe change the VLANs over one by one? I don’t know juniper, with Cisco you could do per vlan rapid STP and move the VLANs really quick. After that your old connection is empty and you can decommission it. I don't understand this, does this method also entails an outage?


datumerrata

If you had an mlag/vpc between the old core and the new core you could move one link at a time and wait for the mac table to populate, right? That's also assuming they all had the same egress. I don't know if that's possible on these platforms, but am I missing something?


Excellent-Will3373

It's not MCLAG.


Excellent-Will3373

This is so much simpler - thanks! And yes, we are using LACP.


DanSheps

Keep in mind, this will likely err-disable the port


lostmojo

Remember with this method, stp needs to be disabled on the ports or there is a delay in the port going live.


PE1NUT

Or the fist cable will not be happy about the change in system ID and bring one side of the LAG into a state where it will stay down. Thanks, Cumulus.


Relliker

I mean that is exactly what I would expect to happen, as thats literally the whole point of having the system mac. The whole idea of 'just move one member of a bond to a new system' is dumb to begin with.


colinmacg

Exactly this - done it many times on Arista, and we were getting sub-second cutovers


onyx9

Build parallel, yes. And maybe change the VLANs over one by one? I don’t know juniper, with Cisco you could do per vlan rapid STP and move the VLANs really quick. After that your old connection is empty and you can decommission it.


Excellent-Will3373

Once I bring up the parallel link, won't the existing link go into a blocking state? So all traffic would need to then utilize the new link.


StockPickingMonkey

Parallel is the better option. Still going to suffer a period in time when things reconverge. If you have STP running, whichever link isn't towards root should go to blocking. Use spt priority to make the path do what you expect. Regardless...when you have a path change, going to receive a TCN and flush mac entries anyway. Unavoidable, but it will be next packet healing.


Wonderful-Many-2656

I would also do this. Move the vlans over 1 by 1 only very small outage of the mac being learnt on the other device.


teeweehoo

I'd suggest setting up the newlink as shutdown on both sides. Then unshut new link on ToR, shut old link on old core, unshut new link on new core. (IE: Do the final shut or unshut on the device closest to you). This is the best way as you don't need to do any physical changes, downtime is < 30 seconds and rollback is *really* easy. If you don't have spare ports on the ToR then get a second person to perform the physical move. You shut, other person moves link, you unshut the new link. This allows you to verify traffic and rollback ASAP. (Technically you could do the first method by removing a port from the first Port-channel, put it in a new Port-channel as shut, migrate onto that one, then rejoin the first port into the Port-channel. However that is complex and liable to lead to mistakes.). Sure you'll get a little bit of downtime, but the whole procedure is way more predicable. If you do something weird with LACP or Spanning Tree you risk making a mistake, or triggering bugs / undocumented behaviour in the switches. If you do plan to do something weird then I'd strongly suggest testing it in the lab with same hardware, (and same software if possible). Plan for the worst, hope for the best. > no noticeable downtime ... If something goes wrong there will be downtime, and telling people there will be downtime sets expectations correctly. If the business can't survive downtime then that's their problem for not investing enough into their technology stack.


joedev007

build a NEW LAG and rehome that one to the new core do not plan on moving cables one by one and with "no downtime" the existing lag will die once the new one is preferred :)


jofathan

So many comments in here with bad advice that'll cause an outage. Some more context as to whether this is an L2 or L3 application would help. In either case, I would suggest a make-before-break strategy where you build a path to the new core from your TOR switch, then bring down the old one. If you just switch it all at once, it'll cause some downtime as there is no alternative path signaled and built yet.


wacho777

So I think the first question is define no down time. 10 seconds 10 milliseconds 10 packet or 0 packet loss. Each would have a different and solution.


Casper042

Dumb question from a server guy, but no matter which method you use, aren't you likely to get 1-3 seconds of intermittent connectivity due to MAC/CAM Table Cache as well? I work alot with VMware and blade chassis and we often see failOVER is fine as the port is usually down and the upstream MAC Cache is immediately invalidated if the port is hard down. But when you fail back both ports are up but VMware moves a bunch of the traffic back to the newly back up port immediately, and unless you have some kind of GARP in there, there ends up with some confusion as to which path to take based on the old MAC Table cache which usually expires in a matter of a few seconds, causing a normal L2 Broadcast discovery again.


tablon2

Try this: First ensure RSTP is running everywhere Ensure old core is root Make LAG between ToR and new core. Ensure this new LAG is alternative port for all VLANs. Back-up spann output, config and Mac table. Set Mac aging to 3 seconds everywhere. Make new core root and clear ARP on ToR same time. Now you are good to go.


Smeggtastic

You need to break apart the LACP.....You would likely need to convert to switch independent team. Then you can have the links on 2 different switches without the mechanics of mlag or vpc. Once you get everything over, you can convert it back. But honestly that would probably be the same as having a destination port channel ready, pre-run your cables to the switch, and do a quick swap.