Error: "Failed: At least 300s since last heartbeat"

ALERT  LEGACY ARTICLE: The content in this article is no longer updated and is available for reference purposes only. Features and workflows described may be deprecated, significantly changed, or no longer supported.

Environment

  • Datto EDR

Description

Once a survey has been launched by either the Controller, Agent, or the HUNT Server a common error to see "Failed: At least 300s since last heartbeat". This occurs when the Survey falls out of sync with the Cloud instance. If the Datto EDR server does not hear back from the Survey for over 5 minutes (300 seconds), then a timeout occurs. This can happen for a variety of reasons, but generally the reasons will fall in one of two buckets:

  • The Survey is no longer running.

  • The Survey is not able to "talk" back to the SaaS instance.

This article is meant to tackle the most common scenarios which cause the issue. It is by no means comprehensive, but should give you the "next steps" to get your problem solved.

Survey Crash or Communication Problem?

Narrowing down the root-cause:

  • At which step did the 300s timeout occur? Most of these occur right after the "Status changed to active" step.
    mceclip2.png
    The Survey was started successfully, but then the server never heard back for the first "heartbeat".

  • Is the target host in question able to browse to your Datto EDR SaaS Instance? (see below)

  • Is there an antivirus or endpoint protection application on the target host? If so, the survey hash, and/or Datto EDR files, need to be allowlisted.

In the common case where the issue is happening right after "Status changed to active" the most likely culprits are: 

  • Can't talk to the server due to networking/internet problem.

  • Antivirus killed the survey.

How do you tell which one?

Testing/Troubleshooting Network Problems

One test you can try is to browse to your Cloud instance from the target host on HTTPS:

https://<INSTANCENAME>.infocyte.com

If you are not able to at least get to the login prompt then this is most likely the issue. Troubleshoot the browsing issue first:

  • Is the machine able to browse to other sites?
  • Is a web proxy or SSL decryption getting in the way?
  • Is dl.infocyte.com allowlisted in your environment?
  • Are the IP Addresses for your instance allowed through the firewall on port 443?
    Datto EDR IP Addresses to allow:
    3.221.153.58
    3.227.41.20
    3.229.46.33
    35.171.204.49
    52.200.73.72
    52.87.145.239

Even if you can browse successfully to the Cloud instance Server you may still have a more nuanced networking issue. If you suspect as much there are a couple of tools to employ to gain more info about the problem:

  1. The Survey log. This log should either be in the Agent install directory under "\logs\" or in "C:\Windows\Temp\logs\", ("/tmp/logs" for Linux).

    Once the "Failed: At least 300s..." message occurs collect the log from the target host. You will find it at C:\Windows\Temp\agent.log on Windows and /tmp/agent.log on Linux. Look for errors related to connectivity. For example:
    [2020-03-10 09:00:42][hunt_survey::mothership::heartbeat_response] - Error communicating with the server: https://myinfocyte.infocyte.com/api/survey/reply: error trying to connect: peer misbehaved: downgrade to TLS1.2 when TLS1.3 is supported
    [2020-03-10 09:04:42][hunt_survey] - Server appears to be down, stopped responding to heartbeats
  2. Wireshark. This is a more advanced networking diagnostic tool and if you do not already have familiarity with it you may require assistance from our support staff to collect and interpret it. If you do have familiarity with it then to collect packets specific to Cloud traffic use the following Capture Filter:
    host 3.221.153.58 or host 3.227.41.20 or host 3.229.46.33 or 35.171.204.49 or host 52.87.145.239 or host 52.200.73.72
    If you are using the on-premise (HUNT Server) product, use a capture filter for your HUNT Server IP instead.

Antivirus/Endpoint Protection

If the Survey was never heard from after going "Active" and you have ruled out network connectivity problems then the most likely cause is a third party program stopped the Survey from running.

It is recommended to Whitelist the Survey hash in your endpoint security products. You can find the Survey hashes by logging into your Cloud instance, go to your account icon at the right top, push "Admin" and select "Downloads"

If you are not sure if this is the issue or not then to test, download the appropriate Survey Agent for the target host, and from the host itself open an Administrative prompt (Windows) or Sudo terminal (Linux) and type:

Windows:

agent.windows64.exe survey 

Linux 64 bit:

agent.linux64.exe survey 

Linux 32 bit:

agent.linux32.exe --no-delete

Check if the Survey finishes. If it does, then go back to network troubleshooting. If it crashes, check for a security pop-up, or check your endpoint security logs for a block. Then allowlist the hash (recommended) or filename and try again.

Debugging Other Issues

Occasionally, there are issues that do not fit into the above buckets. You may see the "Failed: At least 300s since last heartbeat" error later in the process, such as failing on the "Autostarts" step, for example:

mceclip1.png

In these situations, you will want to grab a debug log and send it to our support team. If the failure is specific to one object (like Autostarts) it is helpful to run the offline survey using the manual scan options verbose mode, collecting "only autostarts."

Note: you will run the following command from an Administrative command prompt.

agent.win64.exe --verbose survey --only-autostarts

For more information, refer to Scanning an offline endpoint.

Once complete, or once the crash occurs, collect the agent-<DATE>.log and review (or send to Datto EDR Support).

Getting Support

Again, this guide is not comprehensive. If you have tried the above and are still unable to reach a resolution please contact Datto Technical Support and we'll help you diagnose the problem.