TimWillemstein2 (Customer) asked a question.

Wildfly cannot start (existing server and on new install [same database])

When we start Wildfly the application tries to start but deployment is cancelled after the defined timeout of 15 minutes with this error (from server.log):

"2024-02-14 15:35:20,872 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [900] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("interface" => "management")]'

"

I have tried extending this to 1 hour, but it gives the exact same error after that time. Any idea's on how to fix this?


  • TimWillemstein2 (Customer)

    I was able to figure out what was going on. For some reason none of the logs threw an error about some queries that got stuck due to table space sizing. I figured it out after checking oracle enterprise manager for problematic queries. After changing some database parameters we were able to get the application up and running. Thanks for the thoughts and insights!

    Selected as Best
  • Staines_ian (RSA Security)

    I see you opened a case for this so I will let the engineer investigate this.

     

    This error just means the server cannot start (some prerequisites are not met) so it backs out.

     

    I see there are other errors before the timeout so I think there is problem with the database connection.

    • TimWillemstein2 (Customer)

      I was able to figure out what was going on. For some reason none of the logs threw an error about some queries that got stuck due to table space sizing. I figured it out after checking oracle enterprise manager for problematic queries. After changing some database parameters we were able to get the application up and running. Thanks for the thoughts and insights!

      Selected as Best
  • OverthinkerDave (Customer)

    Very often these problems occur due to "incorrect environment variables".

    And last time I got this error (or WFLYSRV0062 in my case, which also triggers a roll back) it was due to the timeout not being long enough (default 15min) and if I remember correctly the install succeded after 18min.

     

    It is not easy to increase the timer, but this is my last experience:

    • $AVEKSA_HOME/wildfly/domain/configuration/domain.xml
      • -> no effect
      • although the oracle process $AVEKSA_HOME/wildfly/bin/domain.sh show no timeout at all
        • -> so maybe this process doesn't have any timers even if it says so
    • $AVEKSA_HOME/wildfly/domain/configuration/host.xml
      • -> gives effect on your the application server locally [starting with "Server:img-server-xxxxx"]
    • $AVEKSA_HOME/wildfly/bin/domain.conf, ProcessController section
      • -> gives effect on [Process Controller]
    • $AVEKSA_HOME/wildfly/bin/domain.conf, HostController section
      • -> gives effect on [Host Controller]

     

    But another tip is to always scan for file changes under $AVEKSA_HOME/wildfly/domain/log

    Very often you can see log files change there. And for most base installations a patch is included, and the patch.log is important to follow. When the patch is working usually the machine uses 0% cpu, and only the database is working hard. But these logs at least show something is happening.

     

     

    And of course always check "the usual" operating system parameters (very common cause of timeouts):

     

    On linux

    • Aveksa_System.cfg
      • which is located on two locations if install
        • /root
        • your install directory (usually /home/oracle)
      • which has a lot of references to database parameters
        • parameters could be the problem, do not be afraid to use parameter REMOTE_ORACLE_JDBC_URL (which overrides all other connect parameters, and in my opinion is easier to use)
        • ports should not be an issue (because you used same server) but still try: curl -v telnet://<database ip/dns>:<port to db>
      • which has a reference to JAVA
        • which need to be the RSA provided
        • which must be set to the same java in both 2 files
    • setDeployEnv.sh
    • .bash_profile
      • check both root and oracle user
        • where only root one should be changed if doing new install as non-root-user
      • which calls setDeployEnv.sh

     

    In short: do a "echo <variable-name>" as the user you are doing the install with, and all parameters must show a value, and correspond with (mostly cause of problem) Aveksa_System.cfg

     

    In my opinion all scripts should stop use environment variables, and instead load whatever is set in Aveksa_System.cfg. Already in all start scripts this file is being called, so why even the need for setDeployEnv.sh? With that approach you will also not conflict with other root user variables (if same name).

     

    Sorry for the long post, but I also spent hours on these kind of problems.

    IF I get time I will soon post a guide on some current errors in scripts of v8.0.0 base, and P01 for the same

    Expand Post
  • Staines_ian (RSA Security)

    In older legacy versions we simply did not wait long enough. Increasing the timeout was standard procedure. In current versions of the product that has been resolved with much longer default timeouts.

     

    Normal startup time should be in about a minute, but we wait 15 minutes now. If the server is taking more than 15 minutes to start it actually never si going to start. Simply waiting longer will not resolve the issue. That requires a different approach.