Technical Issues

Started by Mind , Nov 09 2008 09:37 PM

This topic is locked

87 replies to this topic

#1 Mind

Life Member, Director, Moderator, Treasurer
19,822 posts
2,000 ₮

Location:Wausau, WI
✔

Posted 09 November 2008 - 09:37 PM

If anyone has any expertise with the following software and/or functions of the Imminst website please let us know. I have already contacted helldesign about these problems to see what specific problem(s) they might be able to fix. Waiting on a reply. I don't know if they will be able to fix all or none. So if anyone out there has knowledge of IPB, Mysql, Drupal, or IPBwiki, please let us know how you might be able to assist. If you are a prgorammer, great, but even if you only have some slight knowledge about some of these issues it would be helpful to know if there are easy solutions. If someone has the time, perhaps you can delve into some of the help forums that exist for these software products.

1. Slow loading forums. The Imminst forums are relatively large but I have seen much larger forums operating much quicker elsewhere on the Internet. At one point in the past an Imminst director said it was something to do with the way Mysql is set-up to do searches in the database and that there was some way to do faster searches. I am not an expert, so I don't know the exact terminology. It isn't the worst problem in the world but speeding up the forums would sure help with the user experience.

2. The IPB mass email system has problems. When I send out a mass email, usually only 1,000 out of 7,000 emails goes out. Also, the email notification system only works intermittently. This is where people ask to have the forum software notify them by email when someone replies to their topic in the forums. Also, it seems some new users do not get their validation emails and have trouble registering.

3. We tried to install a new wiki program. It didn't work and we are unsure why. It seems there might have been two wiki installations (one called "wiki" and another called "newwiki") and perhaps they were causing conflicts. We are now unable to edit the wiki for some reason.

4. The Drupal installation is causing errors. This is something Richard Leis could explain in greater details. It has something to do with links to Drupal pages going bad or pointing to the wrong page.

#2 Brainbox

Guest
2,860 posts
743 ₮

Location:Middle of nowhere
NO

Posted 11 November 2008 - 06:48 PM

Sorry, I don't have the skill (and stopped "active duty" as programmer some time ago).

#3 lightowl

Guest, F@H
767 posts
5 ₮

Location:Copenhagen, Denmark

Posted 11 November 2008 - 07:21 PM

I can help out with MySQL query/schematic optimization. At some point its not possible to optimize any further though. At that point you would need to think about scaling to multiple machines. I can help with that too if needed.

The email issues is a tricky one. Most often those are issues with aggressive spam filters blocking the messages. It might be a good idea to implement SPF and reverse DNS on the MX IP to be more spam filter friendly.

#4 Richard Leis

Guest
866 posts
0 ₮

Location:Tucson, Arizona

Posted 11 November 2008 - 07:22 PM

Regarding 3: we cannot log into the wiki because I turned the forum/wiki bridge off. When I turn the bridge back on, it slows down the entire site. IpbWiki's response was:

it's odd that you mention on the forum and on the wiki side, because on the forum side there's no extra load because of of IpbWiki (all extra functionality is handled from on the wiki side).

To test the free version of IpbWiki again, it's just a matter of overwriting the existing files and running the 2 setups.

There are 2 options which greatly change the speed on the wiki side, that is if you would disable page or parser cache (or enable the rating system) in the ipbwiki general settings, it would render the wiki slower.

To increase the speed of the wiki this can be done by installing a php caching system such as eaccelerator, APC, Turck mmcache, ... (usually in a shared hosting environment you don't have the ability to do this yourself and you have to ask your host to do this for you)

I will be planning downtime the weekend of November 22 to try to revert IpbWiki to the old version, upgrades, and other maintenance.

#5 Lazarus Long

Life Member, Guardian
8,116 posts
242 ₮

Location:Northern, Western Hemisphere of Earth, Usually of late, New York
✔

Posted 11 November 2008 - 08:18 PM

Rich Lightowl has expressed a willingness to work on this matter and other software related concerns with us so please figure out how to make it so.

#6 lightowl

Guest, F@H
767 posts
5 ₮

Location:Copenhagen, Denmark

Posted 11 November 2008 - 08:28 PM

This diagnostics reveal that there is no reverse DNS record for the MX IP of ImmInst.org. This could certainly be the cause for those of the email not being delivered correctly. You should contact your ISP to make sure the result of the reverse DNS lookup is "mail.imminst.org". This will convince the receiving mail servers that the messages is being delivered from a legitimate and correct mail server.

http://www.mxtoolbox...OST=imminst.org

#7 Mind

Topic Starter
Life Member, Director, Moderator, Treasurer
19,822 posts
2,000 ₮

Location:Wausau, WI
✔

Posted 30 December 2008 - 07:54 PM

Seems the Imminst forum pages have been loading very slow for me in the latter half of this year. In November I was getting timed out a couple times a week, and just last week I got a "flood control warning" a couple times.

If I had to make an estimation, I would say a third of the time it takes at least 5 seconds and closer to 10 seconds to refresh a page or access the forums, a third of the time it is in the 3 to 5 second range, and a third of the time it is fast enough and not bothersome - less than 3 seconds. I would like to know if other people experience this problem/nuisance here at Imminst and/or if you have noticed it at other IPB forums on the net.

Lightowl looked into the CPU usage and found it never went over 45% so it is unlikely we are using up all of the server resources. Any ideas? Anyone know of any optimization techniques for IPB or MySQL?

#8 FunkOdyssey

Guest
3,443 posts
166 ₮

Location:Manchester, CT USA

Posted 30 December 2008 - 08:06 PM

Seems the Imminst forum pages have been loading very slow for me in the latter half of this year. In November I was getting timed out a couple times a week, and just last week I got a "flood control warning" a couple times.

If I had to make an estimation, I would say a third of the time it takes at least 5 seconds and closer to 10 seconds to refresh a page or access the forums, a third of the time it is in the 3 to 5 second range, and a third of the time it is fast enough and not bothersome - less than 3 seconds. I would like to know if other people experience this problem/nuisance here at Imminst and/or if you have noticed it at other IPB forums on the net.

Lightowl looked into the CPU usage and found it never went over 45% so it is unlikely we are using up all of the server resources. Any ideas? Anyone know of any optimization techniques for IPB or MySQL?

Is this a dual-core CPU? Because if the forum serving application is single-threaded, using 45% of a dual-core CPU means you are nearly maxing out one of the cores.

#9 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 28 January 2009 - 05:42 AM

Yes, I've noticed the long page load delays. I'm ready and willing to assist in tuning this thing. I posted in a couple places, but I'm not sure who to contact to get the ball rolling.

I tune large systems for a living. I'm willing to donate my time for free.

Some questions I posted elsewhere, with some extras thrown in:

...
Are we using MySQL, SQL Server or Oracle?

If MySQL, are we using the InnoDB or MyISAM engine?

Is the database server on a different machine than the web server(s)?

Is the storage subsystem just a hard drive (or hard drives) on the server or is it a SAN/NAS architecture?

RAID? (5, 1+0, ?)

Do we have load balanced servers, or just one server?

Are we running on Unix or Windows?

What type of processor(s) do we have?

How much memory?

Are we sharing the machine(s) with any other websites/processing?
...

FunkOdyssey has a good point. Typically, however, PHP will either run in multi-threaded mode (important if we are on Windows) or multi-process mode, so we should be able to use multiple cores/processors. MySQL (and the other databases if we are using one of them) is multi-threaded too. I suppose it is always possible that something has been configured to limit how many threads/processes are being used and if that is less than the number of cores/processors, then we could experience what FunkOdyssey is mentioning.

Without looking into this at all and just going by the CPU data given in the previous posts and assuming we are running the web and database servers on one machine, my best guess is that we have an I/O bottleneck. This can happen for a number of reasons (in no particular order):

1) We may just not have enough I/O capacity. If we are running off a single hard drive, this could be an issue.

2) Related to (1), if we are in a SAN environment, we may not have the SAN and/or file systems laid out in an optimal way.

3) We may not have MySQL (or other DBs) tuned properly to make the most of our I/O device(s).

4) The SQL may not be tuned properly.

5) The database may need some maintenance performed on it. The fact that the performance has gotten gradually worse over time (and assuming nothing else was installed on the system and that our user load hasn't increased dramatically) would generally point to this as the place where I would start if I had access. Don't worry -- even multi-billion dollar corporations with many full-time DBAs seem to have a problem with this issue.

I'm guessing we have fairly modest hardware. This is probably okay, because we most likely don't have a very high user load and we don't have a *ton* of data. Although I guess some of that depends on the data model for IP.Board and how they handle images and other binary data that people may include in posts on here.

It probably just needs a good cleaning and a bit of tweaking to operate much faster than it is now. After that point, when we get into SQL tuning and/or tweaking the database schema, we can expect some user operations to improve and feel "snappier." And then there is PHP and web server tweaking too, once our CPU starts to become the bottleneck. There is definitely a point of diminishing returns at some point and hardware limitations somewhere along the line, as well, if our user load increases.

The good news is that we should be able to support a *lot* of users on relatively inexpensive hardware, these days, provided IP.Board has a scalable design. I mean, our users must have a relatively *huge* think time (time spent processing the data they are given from the system instead of triggering the next event in the system) compared to a lot of systems. Most of their time (on the forums anyway) is spent reading through many posts and, to a lesser extent, typing replies. They are only hitting the server(s) when going to the next batch of posts or submitting their own posts or searching the forums to start their reading.

David

#10 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 28 January 2009 - 04:06 PM

This diagnostics reveal that there is no reverse DNS record for the MX IP of ImmInst.org. This could certainly be the cause for those of the email not being delivered correctly. You should contact your ISP to make sure the result of the reverse DNS lookup is "mail.imminst.org". This will convince the receiving mail servers that the messages is being delivered from a legitimate and correct mail server.

http://www.mxtoolbox...OST=imminst.org

I see that we didn't pass the open relay check ("WARNING! Your server could be an open relay.") Has anyone looked into this? Related to performance issues, it is possible we're being used to funnel SPAM, bogging down the system.

David

#11 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 29 January 2009 - 02:05 AM

Is it just me, or did the site just get much, much, much faster today? Did someone do some maintenance?

David

Edited by davidd, 29 January 2009 - 02:05 AM.

#12 Shepard

Member, Director, Moderator
6,360 posts
932 ₮

Location:Auburn, AL

Posted 13 February 2009 - 07:08 PM

The site has been slowing down again for me, and was apparently down at one point yesterday.

#13 Mind

Topic Starter
Life Member, Director, Moderator, Treasurer
19,822 posts
2,000 ₮

Location:Wausau, WI
✔

Posted 13 February 2009 - 11:05 PM

Thanks for your insights David. Will definitely get this on the to do list.

#14 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 14 February 2009 - 05:46 AM

The site has been slowing down again for me, and was apparently down at one point yesterday.

My offer still stands to volunteer my time to tune it. I've noticed it fast at some times and slow at other times. It makes me think either multiple systems are being served by the same hardware, or it just isn't scaling for some reason when a lot of people are doing searches on here.

David

#15 maxwatt

Guest, Moderator LeadNavigator
4,953 posts
1,627 ₮

Location:New York

Posted 20 February 2009 - 07:08 PM

I used to do database work with Sybase (a relational database somewhat similar to MySQL) and suspect the slow loading could be related to improper index structure.

#16 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 20 February 2009 - 07:53 PM

I used to do database work with Sybase (a relational database somewhat similar to MySQL) and suspect the slow loading could be related to improper index structure.

Good guess. It seems that the slowness is variable in nature (fast some days/hours, fast other times). One of my guesses has been that it may be doing a table scan instead of an index lookup and when we get enough concurrent table scans going on, we max out the storage device (which is probably just a local hard drive on a PC..haven't been able to get any answers on the infrastructure yet). Of course, every time I think of this, I think, "No, they wouldn't release a product that doesn't have proper indexes in place, would they?"

My other guess is that we may have some sort of fragmentation happening (either in the DB layer or in the OS layer). I'd be a bit surprised to see it at the DB layer, because I am guessing we don't have a lot of deletes happening. However, we could be seeing it at the OS layer, depending on whether the DB is set to autogrow the containers, thus increasing the chance of reduced contiguity of sectors on the disk.

Either way, I'd be willing to bet money on it being an I/O problem, not a CPU issue.

We'll never know unless someone is given the go-ahead to tune the thing.

(hint, hint)

David

#17 Mind

Topic Starter
Life Member, Director, Moderator, Treasurer
19,822 posts
2,000 ₮

Location:Wausau, WI
✔

Posted 20 February 2009 - 10:43 PM

Lightowl checked the CPU/Server usage and found it never got above 45%, although he could explain it in better detail.

#18 maestro949

Guest
2,350 posts
4 ₮

Location:Rhode Island, USA

Posted 20 February 2009 - 11:10 PM

My other guess is that we may have some sort of fragmentation happening (either in the DB layer or in the OS layer). I'd be a bit surprised to see it at the DB layer, because I am guessing we don't have a lot of deletes happening. However, we could be seeing it at the OS layer, depending on whether the DB is set to autogrow the containers, thus increasing the chance of reduced contiguity of sectors on the disk.

If this were the case you'd likely see a more consistent pattern of performance degradation rather than the sporadic spikes as described above.

Either way, I'd be willing to bet money on it being an I/O problem, not a CPU issue.

Agreed. I'm guessing its an ISP congestion issue or Disk I/O bottleneck. Canaca should be able to rule out either if you ask them to monitor it. Assuming they are willing to but also be prepared as ISPs are notorious for not owning up to issues and forcing you to upgrade to their more expensive packages but I digress....

I use drupal @ Canaca as well and have noticed that my low-traffic site also has fluxations in response time, sometimes exactly at the same time as ImmInst thus I think that Canaca might have some bottlenecking going on in their network or for particular segment of servers.

One "poor mans" way to rule out database query / DB disk I/O issues is to log into the SQL and execute some substantial queries while ImmInst is having performance issues. If they execute quickly, it's an indicator of poor service delivery from Canaca, i.e. their tubes can't handle the internet traffic rather than a database or disk I/O issue.

#19 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 22 February 2009 - 05:47 AM

My other guess is that we may have some sort of fragmentation happening (either in the DB layer or in the OS layer). I'd be a bit surprised to see it at the DB layer, because I am guessing we don't have a lot of deletes happening. However, we could be seeing it at the OS layer, depending on whether the DB is set to autogrow the containers, thus increasing the chance of reduced contiguity of sectors on the disk.

If this were the case you'd likely see a more consistent pattern of performance degradation rather than the sporadic spikes as described above.

Not necessarily. The extra I/O involved may only start to become a problem under higher load. When there are fewer users on the system (or fewer high I/O operations), there may be enough capacity in the storage subsystem to mask the issue.

Having said that, I'd still lean toward concurrent table scans as a more likely candidate. Hard to say though without looking at the SQL running against the DB.

Either way, I'd be willing to bet money on it being an I/O problem, not a CPU issue.

Agreed. I'm guessing its an ISP congestion issue or Disk I/O bottleneck. Canaca should be able to rule out either if you ask them to monitor it. Assuming they are willing to but also be prepared as ISPs are notorious for not owning up to issues and forcing you to upgrade to their more expensive packages but I digress....

I use drupal @ Canaca as well and have noticed that my low-traffic site also has fluxations in response time, sometimes exactly at the same time as ImmInst thus I think that Canaca might have some bottlenecking going on in their network or for particular segment of servers.

That's very interesting data. There could be some batch jobs that they are running that are competing for resources. These batch jobs *could* be something as simple as backups running. Have you found any particular time periods when these coincidal degradations occur?

One "poor mans" way to rule out database query / DB disk I/O issues is to log into the SQL and execute some substantial queries while ImmInst is having performance issues. If they execute quickly, it's an indicator of poor service delivery from Canaca, i.e. their tubes can't handle the internet traffic rather than a database or disk I/O issue.

Yep, that would be one way to do it. You'd want the queries to hit the same tables that the application is having issues with at the time, however, to be sure, just in case there is a complex storage layout. If everything is on one disk, then it isn't as important to do this.

I'd still like to know what our storage infrastructure looks like. If it is just a local hard drive on a PC and if that is not being shared with other processing, that's a big difference than if it is some sort of shared storage device, possibly being accessed across a shared networking mechanism (whether that be fibre channel or ethernet).

Shared hardware is the bane of consistent performance.

Of course, without anyone being allowed to get in and run tests, we're just left guessing.

David

#20 caliban

Admin, Advisor, Director
9,163 posts
609 ₮

Location:UK
✔

Posted 22 February 2009 - 08:27 PM

Of course, without anyone being allowed to get in and run tests, we're just left guessing.

Guys, we appreciate all the input and we are comitted to adress technical questions and make the forums run more smoothly.
Ligthowl has very very deep access to the software, and if it is useful, I'm sure that we could include other members. On the other hand, I'm you'll appreciate that we can't give technical access to too many people. Maybe you could co-ordinate suggestions with lightowl or come up with some other ideas on how to resolve this matter.

#21 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 23 February 2009 - 01:50 AM

Of course, without anyone being allowed to get in and run tests, we're just left guessing.

Guys, we appreciate all the input and we are comitted to adress technical questions and make the forums run more smoothly.
Ligthowl has very very deep access to the software, and if it is useful, I'm sure that we could include other members. On the other hand, I'm you'll appreciate that we can't give technical access to too many people. Maybe you could co-ordinate suggestions with lightowl or come up with some other ideas on how to resolve this matter.

Unfortunately debugging/performance tuning in that manner isn't usually very fruitful. I'm glad someone is looking into it, as it will hopefully serve to draw in more members if the site runs better. How long has Lightowl been working on the performance issues?

I can understand that you couldn't give access to dozens of people, but I'm not too sure why there would be reluctance to let a few people give it a go. If you are worried about access to data, I can assure you that most of us who do this work for a living have access to far more sensitive data than what the Immortality Institute has to offer.

David

#22 maestro949

Guest
2,350 posts
4 ₮

Location:Rhode Island, USA

Posted 23 February 2009 - 12:30 PM

One thing we need is data. We need to know when / how often the issues are occurring. One thing we could do is write a client side script that simply does an HTTP GET/POST that loads a page that does a message board database query every 3 seconds, tracks the millisecond response time and logs it to a file. Let that run for a couple of weeks and then see if there are any patterns. From there, troubleshooting gets much easier, as does working with Canaca support.

#23 Mind

Topic Starter
Life Member, Director, Moderator, Treasurer
19,822 posts
2,000 ₮

Location:Wausau, WI
✔

Posted 23 February 2009 - 06:49 PM

Lightowl travels for work and at times might be away for a week or so at a time, with not much access or time to address Imminst issues. I have alerted him to this thread and I see we have a handful of people familiar with the operation of MySQL and server/website stuff. I am sure we can set-up a team soon to get these ideas tested out.

Thanks for all the contributions thus far!!

#24 lucid

Guest
1,195 posts
65 ₮

Location:Austin, Tx

Posted 23 February 2009 - 08:42 PM

Yeah I Have been having a rough time browsing the past couple of months.

Mind if you are wondering how to speed it up here are my thoughts:

1. Web Hosting Company - If the site is being hosted by a third party web hosting site (which I would imagine is the case), then you will likely want to talk to one of their service representatives. They may have limits on:
*The number of simultaneous connections.
*The Bandwidth you recieve.
*They may have priority ratings for cpu cycles and bandwidth for people paying different premiums.
2. ISP - If imminst owns the actual computer that serves the website then you will probably need to talk to the ISP provider, with them we are primarily interested in seeing what %of our upload and what % of our download bandwidth that we are using.
3. If it is our own computer server we want to see if the bottleneck is internal, if so it could be that 45% is a high cpu usage if we are running a single thread hosting app on a dual core computer. One of the biggest issues for forum servers is information storage and acces. Usually the bottleneck is going to occur at the hard drives where as forums are going to have many small requests on the server as opposed to a few big ones. This means that the harddrive needle is going to very busy moving from one sector to another (this is referred to as seek time). There are a couple ways to get around this. The first and most common way is to use RAID'ed hard drives, in fact RAID'ed SCSI hard drives are the best for servers but they are more expensive. And the very fastest hard drives for quick access are flash (no moving parts) hard drives. I would be willing to help some along with David.

Bottom line, we need to figure out where the bottle neck is occurring. If it is our hosting company, we need to get a better plan with them or change. If the problem is on our end, then we need to figure out whether it is our server or our ISP. If it is our server then that is the area where I would be able to help most. The other problems are probably going to be resolved on the phone.

#25 Mind

Topic Starter
Life Member, Director, Moderator, Treasurer
19,822 posts
2,000 ₮

Location:Wausau, WI
✔

Posted 23 February 2009 - 09:17 PM

Third party ISP = Canaca. They don't get real high user rankings but have been very helpful for the Institute. Changing providers is highly unlikely because we have a close relationship with Canaca support and we would not be able to find anything cheaper.

Davidd. Thanks for offering to help. I think we could get a team together, hopefully with Lightowl as leader and diagnose the problems. A couple other members have also volunteered to help.

P.S. merging this with the other software/tech thread.

#26 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 24 February 2009 - 02:35 AM

One thing we need is data. We need to know when / how often the issues are occurring. One thing we could do is write a client side script that simply does an HTTP GET/POST that loads a page that does a message board database query every 3 seconds, tracks the millisecond response time and logs it to a file. Let that run for a couple of weeks and then see if there are any patterns. From there, troubleshooting gets much easier, as does working with Canaca support.

That's a reasonable idea. I threw something together and it is monitoring the performance now. I just have it hitting the Forum page. That page has to do a number of queries against the DB and it is one of the pages I've noticed as being the slowest to load. I've already seen up to nearly 20 seconds delay in the past several minutes that my monitor script has been running. The typical response time is under 1 second, but tonight seems to be a good night for slowness.

Earlier today it was typically 1/2 a second, with some blips. There are more blips tonight.

I'll try to put together some graphs and post in the next couple days.

David

#27 niner

Guest
16,276 posts
2,000 ₮

Location:Philadelphia

Posted 24 February 2009 - 03:18 AM

Has anyone contacted the IPB guys about this? They might have some useful insights. My guess is that the problem is on the Canaca side. What do they have to say about it?

#28 lucid

Guest
1,195 posts
65 ₮

Location:Austin, Tx

Posted 24 February 2009 - 05:24 AM

Has anyone contacted the IPB guys about this? They might have some useful insights. My guess is that the problem is on the Canaca side. What do they have to say about it?

If we are hosted with a 3rd party web host as mind said then the problem basically must be on their side (because the only part on our end is our website code).They should have their own people who can figure out where the bottle neck is occurring; we just need to let them know they need to figure out what the problem is. (My guess is that its going to be server hard drives) It is kind of funny with all of the techno-prophesy we hear here @ imminst, that our website so slow that it barely loads sometimes; hopefully we won't have to shell out more $$ and they can swap us to a stronger server.

#29 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 24 February 2009 - 05:59 AM

Has anyone contacted the IPB guys about this? They might have some useful insights. My guess is that the problem is on the Canaca side. What do they have to say about it?

I've surfed various forums on IPB issues, but I don't know who has access to the ImmInst account to post proper tickets/issues. I'm guessing it would need to be Lightowl, as they would most likely ask for a lot of stuff that only he'd have access to.

In the little bit of time I've spent on this today, I've found that certain pieces of functionality on the boards are killing performance. I had a hunch on a few things and ran some tests. The biggest offender is the member profile functionality. Going to a member's profile or clicking on the various tabs within a profile (Gallery, My Videos, etc.) causes the system to hang for an average of 10 seconds. In other words, if I click on one of these tabs in my session, I'm going to hang your session (and everyone else, most likely) for about 10 seconds. If you get multiple people doing this at the same time, the effects can (sometimes) be additive (20 seconds, 30 seconds, etc.).

After seeing this, I'm surprised we are able to function as well as we are. If we had more users and if people used the profiles more, we'd be dead in the water. Our saving grace is that most user time is probably spent reading/writing for relatively long periods of time in the forums, rather than a) clicking around in profiles and b) jumping around in the forums.

1) I'm putting more money down on disk related issues and less on network, from what I've seen in my testing. In this school of thought, I'm going to weight inefficient queries (including possibly an inefficient data model and/or indexing) and also the possibility of table locks. I'm hoping it is table locks, because those might be easier to cure than the queries by just changing some settings or trying another MySQL engine (I still have no clue what we are running).

The other 2 thoughts I have running around in my head are:

2) Some sort of setting (hopefully) or bad coding (in IPB's PHP) that is serializing our requests. I mean, there is definitely some level of serialization happening, but I can't tell whether it is being imposed by resource constraints (the disk stuff I mentioned above in (1)) or through voluntary mechanisms in the code/configuration.

3) The database buffer may be undersized. If this is our culpret, then what is happening is that certain requests are pulling data off disk and putting it into memory, flushing out what is already stored there. When a user tries to access a page (like the main forum page), the DB has to go back to disk to get it, but it is fighting with (or waiting for) the first request to fill the buffer so the data can be delivered to the application server/browser. After the data has been delivered, then the second request gets its turn to put data back into the buffer and that data can be delivered to the application server/browser.

I'm heavily weighting (1), but it is pretty hard to tell, remotely, whether it is (1) or (2). I would need access to the database to narrow it down further.

David

#30 davidd

Guest, F@H
328 posts
1 ₮

Location:Minnesota

Posted 24 February 2009 - 05:22 PM

I did some analysis of the data that I captured over the night. There were a few blips. There was a 3 minute period, starting at about 1:43:30 am (CST) (A) and another 3 minute period starting at about 5:27:30 am (CST) (B) and a 1 minute period at about 4:20:30 am (CST) ( C ) that affected forum performance.

(A) didn't affect general website performance (non-forum). (B) did. ( C ) did not.

There was a 30 second period at around 7:29 am that affected non-forum performance and had a very slight impact on forum performance.

There was a 3 minute period from about 3:04 am to 3:07 am that affected non-forum performance and had a slight impact on forum performance.

Having said all that, the above are not our issue. They are few and far between and are typical with nighttime batch jobs (database backups, OS level backups, other batch jobs that may affect network, CPU, etc.

(A) and (B) were most likely something acting on the database or part of the storage infrastructure that holds the DB files. If I had to guess, I'd say (A) might have been an OS level backup, which included database files and general website files and (B) might have been just a DB backup. ( C ) was too short to comment on. The other small blips that affected the non-forum stuff and had a slight impact on forum performance could have been disk *or* network, whereas the other items (A and B) were most likely disk related. They were definitely of a different flavor of impact than (A) and (B).

Anyway, like I said, the above are not our issue. The issue is that we have a forum system that has some poor performance in certain pieces of functionality and that functionality also impacts other parts of the forum system when it is used. There is an extreme prevalence for 10 second delays or multiples of 10 second delays (20 seconds, 30 seconds, etc.) when this functionality is used. In other words, this functionality takes 10 seconds and it blocks other requests for 10 seconds.

Initially, I was willing to believe that it was coincidence and that it just took 10 seconds for the queries to run. Now I'm thinking it is either some sort of polling or a timeout situation. I'd almost be willing to believe that it could be something in the IP Board application layer. The user profile functionality *acts* differently than other parts of the product. You'll notice that a box pops up with the text "Loading Content..." on it, with an animation of a circle with arrows rotating around the outside.

This tells me that the IPB people either (a) knew that there were these delays, so they put up the box to let people know that the application was still functioning, or (b) they knew that profile related page builds could be slow (due to the multimedia content -- images, videos, etc.), so they implemented the box to let people know that the application was still functioning. If (a), then we need to contact the cryonics companies to revoke any possible IPB developer policies. Heh, I'm only kidding........sort of.

If (b), then it is possible that the box mechanism itself may have accidentally *introduced* the fixed latency we are seeing. They could have even *purposefully* throttled back the performance of the profile functionality, with the intention of keeping that type of activity from affecting other parts of the application, since, again, the profile functionality is generally higher I/O than text retrieval from the forums. If so, then it could be that they accidentally introduced a blocking mechanism, thus doing the opposite of their original intent. *Or*, it could be that we have our pieces of the system (web server, app server, db server) configured in such a way that makes their box functionality block other sessions.

Current money is on IPB layer, programmed latency. Either that, or a combination of that and database table locks. For example, while the session accessing the profile functionality is waiting, it could have a lock on a database table and that lock could be what is indirectly delaying the other forum queries, not the application level profile functionality itself, directly. I'm switching over to the IPB layer due to the consistency of the delays. I find it more likely that there would be a 10 second programmed delay in the web/application tiers than the DB tier. If we were seeing variable performance delays, then I'd lean more toward a resource scaling/contention issue.

The other thing I did was hunt around for other sites that use IPB. I found a few that did have profile functionality and were open to the public (no login required) and those sites had good performance (less than a second per tab click in the user profile). Some of them did have the box that popped up too. I mention this, because at least there is hope. They may have been running on a completely different operating system or using a completely different web/app/DB server, but at least it is possible for that functionality to perform well.