Firebird News

Monday, March 06, 2006

Vulcan threading - needs to be fixed


INTRODUCTION

The point of this email is to call attention to what I think is the most critical outstanding element of the Vulcan project - the reliability of the threading and sharing model. This email is, I'm sure, going to devolve in unexpected ways; I'd ask that you read it carefully and think before replying, but please do reply... I believe the "Vulcan" part of the Firebird community very much needs the input of the rest of you all.

Vulcan currently has two conditionally defined threading models. More accurately, two models describing very different levels of sharing of data structures. They have very different qualities, and I believe neither one (in it's current form, at least) is what we ultimately must have.



SHARED_CACHE

If the symbol SHARED_CACHE is defined (the default currently) then Jim's original Vulcan model of fine-grained threading and data sharing is built. This model works generally on the assumption that most things should be shared if possible, and relatively light-weight locking mechanisms are employed to provide serialization/exclusivity when it is recognized that there is a shared data structure that must be made thread-safe.

Key benefits of this model include a more efficient memory footprint since there are very few redundant copies of data structures in memory. This is at the cost of increased synchronization primitive use; a read that does not actually have any conflicts with another thread must act as if a conflict is possible or imminent, and so there is a performance cost to most data access paths to protect the code from potential collision from other threads.



(NO) SHARED_CACHE

If the SHARED_CACHE symbol is not defined, then a much coarser level of sharing is implemented, more akin to the Classic model - each thread has it's own page cache, and so data structures resolved from the page cache are usually thread-private. When cache coherency between copies of the cache must be resolved, the same locking mechanisms are used as in SHARED_CACHE to support code to invalidate pages and force reloads.

Key benefits include a much faster read performance rate since far less data is shared. The cost of this is that write rates have a disproportionately negative impact on performance since updated pages must be invalidated in every connection's copy of the page cache. Worse, the memory foot print increases linearly with the number of connections and the per-connection page cache must therefore be kept relatively small, reducing the effectiveness of the cache.


THE BIG PROBLEM

The point of all this is that the biggest problem with the SHARED_CACHE model is that it does not work reliably under load. We can argue the merits of the test cases for a while, but putting Vulcan under significant load on a real multiprocessor system causes it to fail, and fail relatively quickly. Sometimes the failure is caused by undiscovered critical sections that need additional locking, but sometimes the failure is a deadlock because the relationship of critical sections wasn't completely predicted when the locks were added.

Based on my experience with the embedded usage scenario for Vulcan, I do not believe that SHARED_CACHE works reliably enough to actually use in its current form. Period.



BREAKING IT DOWN

There are quite a few issues that come from this. Here are many of my thoughts, in no specific order. When I say "we" in the following paragraphs, I mean the entire community of Firebird and Vulcan developers, not just my particular company.


1. The (NO)SHARED_CACHE model is not the long-term future, and only works in server environments where very large amounts of memory can be dedicated to the server and/or connection throttling and pooling mechanisms attempt to mitigate the scalability problem. It is a stop-gap that we introduced because we needed Vulcan to work in a timeframe that matches some of our product delivery dates.

2. The SHARED_CACHE model does not work, and is very very hard to debug. Clearly debugging by careful inspection can go a long way, but presumes that the inspector(s) know what they are looking for and/or understand the intended sharing model clearly. It also requires a lot of time and effort. SAS has thus far been unable to make this work ourselves, in part because the underlying lock manager appears to deadlock for reasons we have not been able to solve.

3. Testing multi-threaded code is difficult, and nearly impossible to do without the right hardware. I'll be the first to acknowledge that this is a significant problem in an open source community where many of the participants are either self-funded or trying to balance business objectives against community involvement and may not have the luxury of dedicating significant hardware to this. We need to find a practical way to resolve the test environment issues so that the core development team (at least) can regularly validate changes.

4. We need an agreed-upon test suite to validate the threading and sharing model. SAS uses a set of tests ("threadtest") that can be variably configured to control the number of simultaneous client connections and the work load. I think we've shared this already, but if not we'll be glad to make it available as a candidate test tool. More importantly, the community needs to agree on a test scenario that is considered a minimum requirement for this scenario that must pass before Vulcan can be considered ready for real users.

5. We cannot accept a system that works "for a while" or "for a few users" and then dies in a horrible way, hard to reproduce and putting data at risk. Its fine for a server to have limits, but those limits should be something that can be determined in advance or predicted at runtime so a thoughtful and reasonable response can be given to clients and tools. Declaring a test scenario unreasonable because it is hard to debug is not acceptable. Declaring a test case unreasonable and yet still guaranteeing that the server has a reasonable response is great.

6. There are volumes written about the difficulty of retro-fitting threading on code never written to support it. This is a very hard thing to do at all, and very very very hard to do well. My point is not to be a naysayer about Vulcan, but to make clear that we must understand that threading the Firebird code base is not simple to implement, test, or debug.

7. This is important because planning for the Vulcan/FB integration needs to take into account a realistic view of the state of Vulcan. If we think that Vulcan is nearly done as a 1.0 release (the timeline published during the conference called for a public release around now, if I recall) then a merger really becomes about the divergence in the code bases and reconciling them. If Vulcan really can't perform as it must to be used as the fundamental engine architecture of FB3, then the FB3 merger is about far more than resolving class name changes and updating the optimizer - it is about the guts of Vulcan and whether the sharing model really understands what is shared, when, and why.

8. Related to this has to be some kind of protocol for testing before a push. Threading issues are so hard to debug that they become exponentially harder to track down and repair the further they become buried in the CVS history records. If code is changed that could reasonable be expected to influence the thread-safety of the project, it must be tested against the approved benchmarks regularly and ideally before commit for large changes. It is bad enough when single threaded code is pushed without adequate validation, but it can be the death of a code base when those changes affect concurrency. You can take this any way you like, but too much of Vulcan has already been written without adequate testing. We are compounding the difficulty of retrofitting threading on a monolith by attempting to retrofit quality on the threading. Lots of this is due to the issues above involving availability of test environments and agreement on the requirements, but it's a problem
that will get worse - not better - as more folks get involved in modifying Vulcan.


WHAT DO WE NEED TO DO NEXT?

I don't want to start a flame war, and I sure don't want to retard forward progress on Vulcan. But I work with more than a few smart people, and our team couldn't make SHARED_CACHE work reliably after trying for quite a while, using the best tools and systems we had available. It is my sincere hope that this was largely because we at SAS are still learning the Firebird/Vulcan architecture, and if we had the kind of deep knowledge that many of you have, it would be a workable problem.

My point is that I think this is a big deal, and needs to be tackled before there is too much planning on a FB3 merger that will take away energy from the "does it really work" question. And that question needs independent validation, from parties other than either Jim or me.

The community needs a plan to tackle this, and some resources to do it.

1. A test scenario needs to be developed. There needs to be an assessment about a reasonable test server configuration, in terms of number of processors, CPU power, memory footprint, etc. And we need to understand the test case to be thrown at that server: number of clients, think time versus work time, read/write ratios, degree of sharing of indexes, tables, and databases, etc.

2. There needs to be a definition of a "successful" test case. This should include what must happen during the test run, and what must not happen during the test run.

3. There needs to be a thorough review of the test software itself. We don't want the test driver to contain implicit assumptions or limits that skew the test towards any particular outcome. The test should not be what just one company wants or needs, but should represent the community interest.

4. There needs to be a test configuration provided for an independent assessment, with an agreed upon build of the system and a validation that it meets the requirements set out in item #1 above.

5. The tests need to be published and the community needs to openly discuss what they mean and what influence they will have on FB3 planning and execution.



CONCLUSION

At the end of the day, SAS is just one of the members of this community - and a fledgling one at that. But we have some resources that we can use to help solve these problems, even though we do not have the full knowledge of Firebird internals.

More importantly, we think that the promise of Vulcan is valuable and good, but the current implementation is not getting the attention it needs to fulfill that promise. We can help, but only so far as our knowledge takes us. I believe that the community needs to pick up the issue of Vulcan's scalability and threading and make it the first priority of post-FB2 work. And we need to start with a dispassionate evaluation of what we have in Vulcan today.

I wrestled with whether this was a Firebird developer or architecture issue, but I have settled on the developer list - we have real problems with the real code already implemented, and need a plan of attack on how to fix it that is larger than assigning it to a core developer.

I look to the community for input on what to do next.


--
Tom Cole, Platform R&D, SAS Institute Inc.

No comments: