Reducing Critical Incidents by 40% with ELK & AppDynamics Observability

Table of Contents

So you are selected

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

The Challenge

Our client, ICE Mortgage Technology (previously Ellie Mae), a major mortgage technology provider in the USA, recently acquired a company with a remote data center that lacked observability. There was limited logging, monitoring, or application performance tracking in place.

Applications in this data center generated over 2 TBs of logs every day, but Site Reliability Engineers (SREs) had to manually log into servers and sift through raw log files to troubleshoot issues. As a result:

Sev0 and Sev1 incidents were exceptionally high.
Incident resolution times were significantly longer compared to other business units.

Although the client had an Elasticsearch (ES) cluster in the cloud, this remote data center had no direct connectivity. Any log data had to be routed through another connected data center.

Our Solution

We designed and implemented a scalable observability solution using the ELK stack (Elasticsearch, Logstash, Kibana) and AppDynamics APM. Our approach enabled end-to-end logging, monitoring, and proactive alerting across the remote data center.

Key Components:

Two-Tier Logstash Architecture:
Log Aggregator: Deployed within the remote data center to collect logs locally.
Log Parser & Enrichment Layer: Hosted in the connected data center to parse and forward data to Elasticsearch.

Efficient Log Transfer:

We used the Lumberjack protocol for secure and low-bandwidth log shipping over a VPN tunnel between data centers. 

Automated Deployments:

 Infrastructure as Code was implemented using Jenkins, Ansible, and Git to automate deployment and manage configurations for Beats agents and Logstash servers.

Application Performance Monitoring:

 AppDynamics agents were deployed on application servers, providing real-time APM dashboards and insights.

Integrated Alerts & Reporting:

Alerts were configured for common failure points, routed via Slack, email, xMatters, and Jira webhooks.

Daily KPI reports were automatically sent to product owners.

Architecture Overview

App Servers (Beats agents) → Log Aggregator (Logstash Layer 1) → Log Parser & Enrichment (Logstash Layer 2 via Lumberjack) → Elasticsearch Cluster

The Results

Within just one month of production deployment, we delivered measurable improvements:

40% reduction in Sev0 and Sev1 incidents
30-minute decrease in average incident resolution time
Proactive alerts helped SRE teams catch issues before they escalated.
Product owners received daily reports on key performance indicators (KPIs).

Why This Matters

In today’s digital landscape, downtime and slow incident response aren’t just technical issues — they directly impact customer trust, revenue, and brand reputation. For our client, every Sev0 or Sev1 incident meant potential disruption for thousands of users relying on their mortgage technology platform.

By implementing centralized observability, we not only reduced incidents but also empowered their teams with real-time insights, faster root cause analysis, and proactive alerting.

This solution didn’t just improve IT operations — it enhanced overall business resilience, enabling faster decisions, reduced operational costs, and a better experience for end customers.

Start your recruitment process the right way!

Recruit the next top tech talent on contract for your clients, with ConsultAdd.

Explore All Jobs