I did some analysis of the data that I captured over the night. There were a few blips. There was a 3 minute period, starting at about 1:43:30 am (CST) (A) and another 3 minute period starting at about 5:27:30 am (CST) (B) and a 1 minute period at about 4:20:30 am (CST) ( C ) that affected forum performance.
(A) didn't affect general website performance (non-forum). (B) did. ( C ) did not.
There was a 30 second period at around 7:29 am that affected non-forum performance and had a very slight impact on forum performance.
There was a 3 minute period from about 3:04 am to 3:07 am that affected non-forum performance and had a slight impact on forum performance.
Having said all that, the above are not our issue. They are few and far between and are typical with nighttime batch jobs (database backups, OS level backups, other batch jobs that may affect network, CPU, etc.
(A) and (B) were most likely something acting on the database or part of the storage infrastructure that holds the DB files. If I had to guess, I'd say (A) might have been an OS level backup, which included database files and general website files and (B) might have been just a DB backup. ( C ) was too short to comment on. The other small blips that affected the non-forum stuff and had a slight impact on forum performance could have been disk *or* network, whereas the other items (A and B) were most likely disk related. They were definitely of a different flavor of impact than (A) and (B).
Anyway, like I said, the above are not our issue. The issue is that we have a forum system that has some poor performance in certain pieces of functionality and that functionality also impacts other parts of the forum system when it is used. There is an extreme prevalence for 10 second delays or multiples of 10 second delays (20 seconds, 30 seconds, etc.) when this functionality is used. In other words, this functionality takes 10 seconds and it blocks other requests for 10 seconds.
Initially, I was willing to believe that it was coincidence and that it just took 10 seconds for the queries to run. Now I'm thinking it is either some sort of polling or a timeout situation. I'd almost be willing to believe that it could be something in the IP Board application layer. The user profile functionality *acts* differently than other parts of the product. You'll notice that a box pops up with the text "Loading Content..." on it, with an animation of a circle with arrows rotating around the outside.
This tells me that the IPB people either (a) knew that there were these delays, so they put up the box to let people know that the application was still functioning, or (b) they knew that profile related page builds could be slow (due to the multimedia content -- images, videos, etc.), so they implemented the box to let people know that the application was still functioning. If (a), then we need to contact the cryonics companies to revoke any possible IPB developer policies. Heh, I'm only kidding........sort of.

If (b), then it is possible that the box mechanism itself may have accidentally *introduced* the fixed latency we are seeing. They could have even *purposefully* throttled back the performance of the profile functionality, with the intention of keeping that type of activity from affecting other parts of the application, since, again, the profile functionality is generally higher I/O than text retrieval from the forums. If so, then it could be that they accidentally introduced a blocking mechanism, thus doing the opposite of their original intent. *Or*, it could be that we have our pieces of the system (web server, app server, db server) configured in such a way that makes their box functionality block other sessions.
Current money is on IPB layer, programmed latency. Either that, or a combination of that and database table locks. For example, while the session accessing the profile functionality is waiting, it could have a lock on a database table and that lock could be what is indirectly delaying the other forum queries, not the application level profile functionality itself, directly. I'm switching over to the IPB layer due to the consistency of the delays. I find it more likely that there would be a 10 second programmed delay in the web/application tiers than the DB tier. If we were seeing variable performance delays, then I'd lean more toward a resource scaling/contention issue.
The other thing I did was hunt around for other sites that use IPB. I found a few that did have profile functionality and were open to the public (no login required) and those sites had good performance (less than a second per tab click in the user profile). Some of them did have the box that popped up too. I mention this, because at least there is hope. They may have been running on a completely different operating system or using a completely different web/app/DB server, but at least it is possible for that functionality to perform well.
David
Edited by davidd, 24 February 2009 - 05:25 PM.