Connect. Communicate. Collaborate. Securely.

Home » Kerio User Forums » Kerio Control » Troubleshoot possible leakage
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
Hi,
I'm trying to troubleshoot what possibly is a memory leak, resource leak.

I am continuously monitoring network traffic, response times etc of a number of hosts behind a Kerio.

At times response times starts to increase then drop and increase again. After restarting Kerio it becomes normal again. I've seen this a number of times, sometimes happening after a week, sometimes a month.

I've been using the beta versions. Now lately the most recent one (reinstalled last week after the kernel becoming corrupt).

What can be done to try and pinpoint exactly what is causing this?

Attaching a graph of the ping response when this occurs. The red is when Kerio is restarted.

[Updated on: Tue, 30 October 2012 12:42]

  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
What does a throughput test look like?

This also appears to be something other than a memory leak. There are quite a few explanations for such behavior that could be considered normal.
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
Usually throughput is ok, one time I encountered bandwidth being severely limited, causing VPN tunnels to time out, all worked fine after restart.

And yes this might not be a memory leak, but it's something odd.

If I have 10 days of pretty static response times of for example 10ms. And then suddenly it starts to go up like the graph shown - that is NOT normal.

And this is not a single server so can't blame that, this happens on all monitors at the same time. Hence why I am suspecting this to be Kerio.

Maybe a pool or resources getting overloaded and once recycled it goes down only to increase, but since it doesn't happen from the start it makes me wonder what may cause it.
  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
Have you attempted any packet captures?
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
I have not tried to capture any packets yet as I am not sure what to look for. Except maybe packet sizes...

Attaching a new graph of what happened today.



Update: Right now it seems to be in the period where I have the higher ping times again, and throughput now is very very bad.

Turning off the Intrusion Prevention made the throughput go up but still not anywhere near what I can get.

[Updated on: Wed, 31 October 2012 12:41]

  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
Do you have an RTT tool other than a simple ping?

Ping is often handled differently by firewalls than other traffic. This is due to ping being used by hackers to either probe the network or inject problems (smurf attacks). It is also by design to be fragile and low priority. You may be running into a defense mechanism kicking in extra processing to monitor that activity. All these reasons make ping a poor performance tool (versus the alternatives).

A better test would be to use a TCP and UDP based performance tool that measures throughput and response time. This is also where packet captures would work, preferably on both sides of the firewall. You could verify packet response times in lieu of an actual performance tool. If you find that your PING (specifically, ICMP echos) times are increasing, but your application performance is unfettered, you may be chasing a red herring.

I do agree that you do appear to be onto something that is undescribed. I can't find any mention in any of Kerio's documentation that they should perform like this. However, if you want to be conclusive, you'll need more data to eliminate more possibilities.

Your other option is to contact Kerio's support directly. This is really a user-to-user forum. The Kerio folks certainly maintain a presence here, but I don't believe they have any legal obligation to do so.
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
Yeah I also monitor TCP response times (attached) and it shows the same behavior, the graph is not the same destination as the ping (even though I monitor ping as well).

I am aware of the user-to-user forum rather than a support tool, figured I will start here to see if anyone else noticed the same or maybe have ideas since after all it's a beta Smile

At the moment the throughput is around 8MBit/s both up and down.

  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
If TCP response is also matching ICMP response, then the last thing to monitor is the system health during such an event. Is the Control device seeing high CPU utilization or memory usage?

Do you have access to a non-beta version of Control? For comparison's sake.

I don't see many other paths for you to explore. It is good to see people still approach a beta properly, instead of assuming it is just free stuff. I definitely believe you have done your due diligence as a beta user. It is up to Kerio whether they want to continue exploring this behavior.

I'm guessing they'll want you to describe your actual Control device (hardware, OS, other software, etc.). If you want to share that information on these forums, we could probably continue to hash through some other possibilities. That's up to you and what you have on the plate.
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
Oddly enough, there's no change in CPU Or memory when this occurs (apart from when turning off IPS/AV since a bit of memory is free'd up). I don't even get the CPU to climb over 40% when doing speedtests with 500+MBit/s speeds (when it's not acting up ofc) so there's no bottlenecks there at least.

Hardware-wise it's nothing super, nor nothing really bad. A mid-range motherboard and an i5-2400 3.1GHz, 4GB RAM and a SSD. Running the latest beta of software applicance.

External interface is an Intel Gigabit nic (which I will try disable and re-enable to see if it happens to be something there, as I ruled out the internal interface to be a problem due to the monitors to kerio itself not showing any increased response-times).

I will see about trying to switch disks and install a stable release with same configuration.

Thanks for the ideas and tips, also good to just throw out thoughts instead of having to be all alone looking at my weird graphs haha Smile
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
I noticed something interesting this morning, rebooted last night at 8pm and around 8am this morning I began to see response increase again.

Logging on to shell and checking dmesg I get shown the following:

[ 34.933197] kvnet!always_off(): called.
[43977.026310] irq 16: nobody cared (try booting with the "irqpoll" option)
[43977.026314] Pid: 0, comm: swapper/0 Tainted: G O 3.2.0-k1-kerio-686 #1
[43977.026316] Call Trace:
[43977.026321] [<c107120e>] ? __report_bad_irq+0x15/0x8d
[43977.026323] [<c1071215>] ? __report_bad_irq+0x1c/0x8d
[43977.026326] [<c10713ab>] ? note_interrupt+0x125/0x18f
[43977.026328] [<c106fdb9>] ? handle_irq_event_percpu+0x13f/0x151
[43977.026330] [<c1071847>] ? handle_edge_irq+0xa0/0xa0
[43977.026332] [<c106fdec>] ? handle_irq_event+0x21/0x39
[43977.026334] [<c1071847>] ? handle_edge_irq+0xa0/0xa0
[43977.026337] [<c10718b1>] ? handle_fasteoi_irq+0x6a/0xa6
[43977.026338] <IRQ> [<c1003119>] ? do_IRQ+0x2e/0x84
[43977.026342] [<c12b26f0>] ? common_interrupt+0x30/0x38
[43977.026345] [<c103007b>] ? allow_signal+0x22/0x5a
[43977.026347] [<c1191e77>] ? intel_idle+0xb5/0xdc
[43977.026350] [<c11fdfbc>] ? cpuidle_idle_call+0xca/0x166
[43977.026352] [<c1001bba>] ? cpu_idle+0x95/0xbb
[43977.026354] [<c13f07a4>] ? start_kernel+0x326/0x32b
[43977.026355] handlers:
[43977.026360] [<f83bdd21>] e1000_intr
[43977.026361] Disabling IRQ #16

That is almost, and 44000 seconds is about 12 hours, so they seem to be connected somehow.

So interrupts for IRQ 16:
16: 10900001 0 0 0 IO-APIC-fasteoi eth1

My external interface... Seems I am one step closer to my problems. Now just to figure out why my external interface doesn't respond to calls, and why it keeps on working there's something wrong (maybe the duplex switches, or the card is simply broken).

[Updated on: Thu, 01 November 2012 10:19]

  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
Your external and internal interfaces are the same type of NIC?

You could also try swapping the two and seeing if the problem follows the NIC, or the configuration.
  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
Did you upgrade the VM hardware? What VM type? What are the VM adapter types (flexible, e1000, vmxnet2, etc.)?
  •  
James Bobby

Messages: 35
Karma: -1
Send a private message to this user
No they are different types and so far I've not seen the behavior on the internal interface, only the external.

I've rebooted again to see if the same thing happens again, however, I also modified the kernel boot options. I am now booting with options: acpi=off noacpi pci=routeirq

To see if that makes any difference, but at the same time with the kernel ignoring any faults it could potentially mess up the NIC completely instead of like before, slow down the speed.

NIC runs on e1000, here's the info:
07:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)
Subsystem: Intel Corporation PRO/1000 GT Desktop Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 32 (63750ns min), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fe440000 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at fe420000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at d000 [size=64]
Expansion ROM at fe400000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] PCI-X non-bridge device
Command: DPERE- ERO+ RBC=512 OST=1
Status: Dev=00:00.0 64bit- 133MHz- SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz-
Kernel driver in use: e1000
  •  
silars

Messages: 429
Karma: 59
Send a private message to this user
Checking vCenter, the VM type = 4 for my Control appliance. The adapters are configured as "Flexible" and show up as Lance adapters in Control. This was a 7.3 appliance that was upgraded to 7.4 yesterday.

I've been considering attempting to upgrade the VM hardware and adapters. I'll wait until you figure out your problem.

[Updated on: Fri, 02 November 2012 12:41]

James Bobby

Messages: 35
Karma: -1
Send a private message to this user
silars, not sure why we are discussing VMware now, since I am not using that, maybe I gave the wrong impression somewhere if so my apologies, this is not a VM machine, it's a physical server.

However, so far, it is running good now (with the new kernel options) for over a day.

I have found other users of various linux distributions to have similar problems with cards stop responding, filed as bugs, this could be similar, but I am waiting to see if this happens again after the boot option changes.
Previous Topic: PLEASE HELP
Next Topic: Antivirus kerio routerfirewall
Goto Forum:
  


Disclaimer:
Kerio discussion forums are intended for open communication between forum members and may contain information and material posted by members which may be useful in learning about Kerio products. The discussion forums are not intended to provide technical support for any specific product. Any information implied or expressed in the discussion forums is that of the posting member. Kerio is in no way responsible for the information posted in the forums, or its accuracy. Kerio employees may participate in the discussions, but their postings do not represent an offical position of the company on any issues raised or discussed. Kerio reserves the right to monitor and maintain the forums to promote free and accurate exchange of information.

Current Time: Thu Aug 17 23:11:17 CEST 2017

Total time taken to generate the page: 0.00544 seconds
.:: Contact :: Home ::.
Powered by: FUDforum 3.0.4.