The Hunger Site

Friday, May 17, 2013

Demystifying WebLogic and Fusion Middleware Management


Demystifying WebLogic and Fusion Middleware Management --- by Glen Hawkins, Senior Director, Product Management

This week, we are going to switch gears and talk about something that is near and dear to everyone responsible for running applications and middleware in their environment and that is monitoring and management with specific emphasis on the Oracle Enterprise Manager Cloud Control solution. 
Often, this particular topic is dismissed early in the architectural discussion and doesn’t rear its (sometimes ugly) head until development is fairly far along on a new application and planning their deployment or worse, problems in production begin to impact the overall service levels of an application to the point that the end-users are complaining or top line revenue is being lost because of poor performance or reliability problems.  The result is that the inexperienced will treat monitoring and management of their middle tier and their application system as a whole as an afterthought, while those that are more experienced or forward looking will tackle it from day one.
So, let's start with some common pitfalls or myths that people run into when considering or planning the deployment of their management along with some discussions on each of these points:

I think that most that have attempted this in the past have learned the error of their ways.  Most tools such as administration consoles like the WebLogic Administration Console are designed to get the product up and running and for general configuration and administration purposes of a single domain.  They are not intended as a solution to monitor and manage many domains (possibly even multiple versions of those domains) as well as the entire application infrastructure (i.e. Databases, Hosts, Message Queues, Service Buses, etc) at once.  And, they routinely don’t provide any historical metrics or real 24/7 diagnostics.  No administrator wants to be in a situation where a problem occurred an hour ago and they no longer have any information on it because they only have real-time data.  You need both real-time and historical monitoring and diagnostics capabilities. 
In addition, administrators routinely want to be able to answer the usual question that comes up when everything was running fine one day and fails to perform on the next, which is “what has changed”.  You need historical information to refer to at all tiers of the application including the host as well as visibility across the stack including both monitoring and configuration data to answer that question. 
Possible answers could be that the end-users have increased, the way the end-users were using the application has changed (i.e. that marketing event you didn’t know about changed behavior), application changes, WebLogic domain changes, JVM changes, a patch was applied, or someone even may have started running something new on the machine or impacted the OS. 
Correlating these changes and coming to a quick conclusion is key to ensure optimal application service levels in a production environment for your end-users.  That means that you need a full stack 24/7 real-time and historical monitoring solution that can also provide meaningful diagnostics and and track/compare configuration standards across the entire application system stack which is something that only Oracle Enterprise Manager Cloud Control is able to provide in the case of the Oracle stack.
This one is quite simple at the end of the day, especially for those that have been pulled into a war room in regards to a production application emergency with all the finger pointing and frustration that routinely ensues.  The various team members responsible for the different portions of an application system almost always need to collaborate to resolve problems.  By using separate tools, collaboration can be slow and frustrating. 
A single pane of glass with different roles and privileges mitigating who can see what allows everyone to speak the same language.  At the end of the day, when a fire drill arises, communication and collaboration will allow you to pull through, which is greatly enhanced with the correct solution. 
Oracle’s Enterprise Manager Cloud Control solution was designed to promote this level of communication between roles with flexible dashboards providing different views of the application to different team members and diagnostics that can provide meaningful diagnostics such as bi-directional navigation between JVM threads and Oracle database sessions which goes well beyond just isolating SQL calls and the Middleware Diagnostics Advisor which provides recommendations diagnostic findings for WebLogic stack to quickly cut down on your time to resolution as opposed to raw metrics which force you to piece together fragments of the story from completely separate tools.
I think this particular myth tends to surprise those that are new to application and middle-tier management.  In development environments, particularly during the QA and load testing phases for most applications, the environments are usually so well controlled and, as they are not in production, you can more easily reproduce errors and attempt to resolve them in these environments.  However, in production environments, it becomes extremely difficult to reproduce issues as the load, network, application environment, and overall intermittent behavior of all of the tiers can challenge even the most technical operations person including those who developed the application in the first place. 
We routinely see issues reported by end-users in production environments where monitoring is minimal. Often, hours, days, even weeks are spent trying to reproduce issues or waiting for them to happen again if they are intermittent and no historical monitoring and diagnostics is available in the environment.  The bottom-line is that you need to be able to diagnose problems in the production environment itself.
Within Enterprise Manager Cloud Control, both historical and real-time metrics are available 24/7 across all tiers and they are correlated together.  Let me provide a quick simple example of a possible root cause analysis scenario where an application is perhaps degrading in performance over time.  Memory analysis tools by themselves are not able to pinpoint the problem, but it is clear that there is a buildup of referenced objects on the heap (i.e. possibly falling under the high level classification of a “memory leak” like issue, but then again there are possibly other causes).  The historical solution might be to attempt to restart servers on a regular basis trying to maintain high availability as you do, but that will not get you closer to finding the real issue and it is a band-aid at the end of the day that may very well fail when and if capacity increases for your application.
Let’s say we start with getting a notification from Enterprise Manager Cloud Control that a critical alert has occurred on the Work Manager – Pending Requests metric indicating there is a buildup of requests in the application.  This an early indicator and the Request Processing Time alert likely soon to follow if the trend continues, so let’s jump in and diagnose the problem.
First, let’s look at one of the higher level customizable dashboards in the product to see the lay of the land:
We can see from our WebLogic application above (just a simple Medrec example in this case) that all of our servers look like they are up and running, some of our heap and other metrics look high, but not unreasonable with the exception of some of our JVMs which show some DB Wait locks in red in the right-most bottom table.  This is a sure indicator that the pending requests that we were alerted to earlier are likely associated with calls of some kind to the back-end database.  If I click on the JVM in question, I can take this down a level.
Now we are on our JVM target home page within our WebLogic Domain hierarchy (many more metrics and capabilities there that we won’t go into in this blog, but I will provide links below to see those capabilities) where we can see a bit more detail and filter on anything to our heart’s delight by clicking on the various hour glasses to search on methods, requests, SQL, thread state, ECID (a transaction ID in FMW), and other criteria, which will filter the graphs further down the page which show thread breakdowns by many of these dimensions.  I could also immediately create a diagnostic snapshot of the data to look at later if I so desired.  I can also click on the Threads tab (next to the highlighted “General” tab above) and look at historical thread data or play with the timeframe, but we can see just by looking at this that we were correct about the threads in the DB Wait state and it has been going on for some time now.  Let’s navigate from historical to JVM live threads (collected every 2 secs using native thread sampling as opposed to byte code instrumentation) to try to determine the root cause of why so many threads are stuck in the DB Wait state.
Looking above, it is apparent that we are running an SQL prepared statement originating from a front-end request from the “/registerPatient.action” URL.  I could then click on the “SQL ID” to actually bring myself to the SQL in question within a tuning screen, but the route of more interest is to click on the DB Wait link highlighted in the lower half of the screen for one of the threads.  This will take me into a read-only view of the actual Oracle database session itself.
Here we are in the database session itself.  As an operations person or developer, my options are obviously very restricted, but I can see that there is a blocking session ID.  Better yet, I can now click on that blocking session ID and see that something that is entirely outside of my WLS container or JVM  is causing contention and I can now communicate with my DBA to address the problem.  This could have been just as easily a badly tuned SQL statement or perhaps indicated an index problem.  Likewise, I could have discovered that my threads were locked by one another or a Network Wait or even File IO.  There are a multitude of possibilities, but because I have a tool that can see across these tiers, I can quickly diagnose the issue and I am speaking the same language as my DBA.  DBAs can also drill back up by the way from SQL statements to the JVM and WLS container (also in read-only mode obviously), so they can be proactive about maintaining the application.  This is just one simple example of how Enterprise Manager Cloud Control facilitates this type of communication between roles as there are many other similar features from the dashboards which can be tweaked per role giving the appropriate visibility for the various team members or the incident management that is designed to allow teams to collaborate or even work with Oracle Support via the WebLogic Support Workbench if necessary.
It is true that most Java transaction tracing solutions create overhead because of byte code instrumentation.  There is certainly a time and place for this type of diagnostics which can be very detailed and rich in its analysis.  Within Oracle Enterprise Manager Cloud Control, we do have an optional advanced diagnostics feature that provides this functionality.  Overhead is routinely much lower than just about any other solution out there, and it is indeed able to run 24/7 without incurring much overhead.  For many, the little overhead required is reasonable and well worth the enormous amount of visibility you get by being able to track individual or groups of transaction through each tier of your application isolating problems based on the actual payload.
However, for those who prefer to not use byte code instrumentation, the entire example provided above does not require any.  It simply uses the stack metrics collected from the Enterprise Manager Cloud Control agent, which sits on the host (not in the WLS container and thus out of process) and the JVMD agent, an extremely lightweight agent (just a war file) that uses native code sampling (no byte code instrumentation and thus no restart of the managed server).  The bottom-line is that you can get a ton of visibility without incurring any noticeable overhead and decide where and if you want to also trace transactions on an individual basis.  This type of flexibility ensures that all diagnostics needs are met.
Alright, so that was my last myth to dispel for this blog.  I could go on for quite some time and show the many other capabilities of the Enterprise Manager product such as the earlier mentioned Middleware Diagnostics Advisor, log viewing and alerting, the multitude of dashboards, thresholds, lifecycle management, disaster recovery, and patch automation features that span the full capabilities of Oracle’s solution for WebLogic and Fusion Middleware management, but perhaps there will be time for another blog on those topics later.
For now, I will leave you with some resources to help you leap beyond the myths.
Additional Resources 
Free Online Self-Study Courses from Oracle Learning Library (OLL)
WLS Performance Monitoring and Diagnostics
WLS Configuration and Lifecycle Management
Coherence Management
Real User Experience Insight

Disclaimer

Opinions expressed in this blog are entirely the opinions of the writers of this blog, and do not reflect the position of Oracle corporation. No responsiblity will be taken for any resulting effects if any of the instructions or notes in the blog are followed. It is at the reader's own risk and liability.

Blog Archive