this is all L2 from the switches point of view.
Kinda true, but technically not. This is one of the many areas where the OSI model breaks down. Multicast addresses like 239.255.1.2 are clearly L3, but part of what IGMP snooping does is remember what it's seen at that level and maintain the MDB, which is based on (L2) Ethernet multicast addresses, not IP multicast addresses. Atop all that, there isn't a 1:1 mapping between the two.
I do believe I get what you're saying, but hear what I am saying in return: don't get caught up in L2 vs L3, and don't be needlessly exclusionary about it. The real world isn't nearly as simple as the ivory tower people designing X.25 would have wanted, had their gotten their way.
Interestingly: with only one switch everything works as expected even with "unknown multicast flood = no", the problem is clearly the inter-switch link.
It is possible that you have discovered a bug in the multicast implementation. A reply from an MT support request I put in recently about the multicast docs was put off on the grounds that they didn't want to do it until multicast was "fixed and fully functional in v7." So implicitly, it's currently broken and incomplete? Sad if true.
If you can boil your test case down to something a support engineer can try in a few minutes, reporting it might yield a solution faster than trying to hack around it with the help of forum people like me, who are not MT support.
PIM…kinda worked, but every hour joined groups were dropped for a few seconds or so.
That's encouraging. It tells me you're on the right track, and you merely need to find and fix the cause of that last dropout.
More, I suspect you know what the culprit is…
do I need to disable the multicast querier on the first switch
Your symptom does sound a lot like a misconfigured querier. Its whole job is to pinch off unwanted multicast streams. My only uncertainty comes from the fact that your reported "every hour" is much longer than typical querier timeouts. It's more typically down in the 1-15 minutes range, although I will concede that retries might multiply that by a factor of 3-ish. I can therefore believe 45 minutes, but 1 hour…? Only if you increased the initial querier timeouts into the 20-30 minutes range.
One expedient option you might have is to increase the timeouts further. If it only has to work for one broadcast day, a dropout of a few seconds every night in the small hours might not matter.
Or, you could disable the querier entirely. That does mean that over time, streams will flood ports that haven't been interested in the stream for days, months, or years, but a periodic reboot of the switches would clear that out. With unknown multicast flood off, ports will start out quiescent and slowly get noisier until it's time to reset things again.
do I need to disable snopping on any of the switches?
No. To return to your flawed conception above, snooping is L2, PIM is L3.
As a rule, you want one querier per network, and one only. I think that means one on each side of the PIM boundary, but I confess that I'm more of a multicast user than a multicast engineer. I know how it's supposed to work when properly configured, but I have never managed a large multicast network myself, RouterOS or no.
Statistics: Posted by tangent — Sat Feb 17, 2024 8:20 am