The Hidden Complexity of Segmented Elastic Deployments: When Architecture Creates Problems

Paul Veenstra
20 nov 2025
7 minuten om te lezen

You're designing your Elastic Security deployment architecture. You've got multiple domains, security zones, or network segments. The customers architect's logical thought process goes something like this: "We don't want agents from Domain A directly connecting to our Elasticsearch cluster. Let's put Fleet servers in each domain, combine them with Logstash for event forwarding, and create a nice segmented architecture."

On paper, this looks clean. Agents stay in their domains. Traffic flows through controlled chokepoints. Security teams love segmentation, right?

But here's what the architecture diagrams don't show you: the operational complexity, performance challenges, and troubleshooting nightmares you just signed up for.

Let's walk through what actually happens when you implement this setup—and why that "logical" architectural decision might not be so logical after all.

The Architecture That Seems Reasonable

Here's the setup we're talking about:

Domain A:

Elastic agents deployed on endpoints
Fleet Server (Linux VM for redundancy)
Logstash Server (Linux VM for redundancy)
Agents connect to Fleet Server for management
Logstash forwards events to Elasticsearch cluster

Domain B, C, D...:

Same pattern repeated in each domain or network segment

The reasoning sounds solid:

Network segmentation maintained
Traffic between domains controlled
Fleet management distributed
Agents don't need direct connectivity to Elastic search

What could go wrong?

Problem 1: You need to Load Balance your Fleet Servers

Here's the first issue you'll hit: Fleet Server doesn't natively support load balancing the way you'd expect. You deployed two Fleet Servers in Domain A for redundancy. Great. But how do agents know which one to connect to?

You might think "I'll just put a load balancer in front of them." Sounds simple, except it introduces complexity in the setup, as we need to implement this for each and every domain. Without a proper load balancing solution, you end up with:

Uneven load distribution
Complexity in configuration management at scale
Difficulty predicting which Fleet Server is actually serving which agents

If you're going to do this right, you need to invest in proper load balancing infrastructure for each domain. That's more cost, more complexity, more things to manage and troubleshoot.

Problem 2: Control Plane vs. Data Plane—The Resource Battle

Here's where things get really interesting. You've got Fleet Server and Logstash running on the same systems (or at least in the same environment). As your deployment grows, you're probably thinking "let's add some additional integrations to these systems to collect more security telemetry."

Stop right there.

You've just mixed your control plane (Fleet Server managing agents) with your data plane (processing and forwarding security events). These two functions will compete for the same resources—CPU, memory, network bandwidth, disk I/O.

What happens when your data plane gets overwhelmed?

Logstash starts queuing events because it can't process fast enough
Memory pressure builds up
Network connections get saturated (Connection Reset's)
Disk I/O spikes from buffering
--> Latency

What happens to your control plane during this resource contention?

Fleet Server responses slow down
Agent check-ins take longer
Policy updates get delayed
EDR response actions (isolate endpoint, kill process) experience latency

That last one should terrify you. When you need to isolate a compromised endpoint RIGHT NOW, and your Fleet Server is bogged down because integrations on the agent are processing a huge influx of events, you've got a serious problem.

In an incident response scenario, seconds matter. Delays in EDR response capabilities because of architectural decisions can mean the difference between containing a breach and watching it spread.

The principle is simple: don't mix heavy integrations with Fleet Server deployments. Keep your control plane responsive by not burdening it with data plane workloads.

If you need multiple integrations in a domain, deploy separate systems for them. Yes, that's more infrastructure. But the alternative is risking your EDR response capabilities when you need them most.

Problem 3: The Artifact Repository Nightmare

Here's a problem that often gets discovered late in implementation: your agents deployed in segmented domains can't reach the Elastic artifacts repository servers on the internet.

Elastic agents need to fetch binaries, updates, and artifacts. In a standard deployment, they reach out to Elastic's official artifact repositories. Simple and straightforward.

But in your segmented architecture, those agents can't get there. Network policies block it. Firewall rules don't allow it. Security requirements forbid it.

So you need to build a custom solution:

Step 1: Set up a proxy on the same servers as Fleet and Logstash More services competing for resources on already-burdened systems. See the pattern?

Step 2: Deploy a central artifact repository server This server needs to sync with Elastic's official repository to stay current.

Step 3: Configure the proxy to point agents to your central repo Now you're maintaining proxy configurations and ensuring connectivity.

Step 4: Modify Elastic Defend policies In the advanced settings of your Defend policies, you need to specify the custom artifact repository for agents in each domain. Which means a LOT more policies to manage per domain.

Step 5: Keep your central repo synchronized You now own the responsibility of ensuring your artifact repository stays current with Elastic's. Miss an update cycle, and your agents can't get critical security updates.

What started as "let's segment our network properly" has turned into "let's build and maintain custom artifact distribution infrastructure."

The Operational Burden Multiplies

Let's talk about what day-to-day operations look like with this architecture:

Troubleshooting Event Flow Issues: Events flow from agent → Fleet Server → Logstash → Elasticsearch. When events are delayed or missing, which layer is the problem? You're debugging through multiple hops, each with its own potential bottlenecks.

Maintaining Multiple Fleet Servers: Policy changes need to propagate to Fleet Servers in each domain. Configuration drift becomes a real risk. Version upgrades need to be coordinated across all domains.

Maintaining Multiple Agent Policies: As you need to specify the artifact repo in the defend policy advanced settings, you end up with many policies per domain which need to be managed. When you need to change one setting in a defend policy you actually have to change it in a lot of policies for all domains. Configuration drift also becomes a real risk in this case.

Monitoring Performance: You need monitoring for Fleet Server health, Logstash performance, proxy functionality, artifact repo sync status—per domain. That's a lot of dashboards and alerts.

Agent Enrollment: Agents need to be enrolled with the correct Fleet Server URL. That means different enrollment tokens or configurations per domain. More complexity in your deployment automation.

Certificate Management: If you're doing this right, you've got TLS everywhere. That means certificates for Fleet Servers, Logstash, proxies—all per domain. Certificate renewals become a significant operational task.

When Latency Becomes a Security Issue

All this complexity introduces latency at multiple points:

Agent to Fleet Server communication latency
Fleet Server processing and queue latency
Logstash processing and forwarding latency
Network transit time between domains and the central Elasticsearch cluster

For most log data, a few seconds of latency is acceptable. For EDR response actions, it's not. When you need to isolate a compromised endpoint, you need it to happen in near real-time. Latency in the control plane—caused by resource contention or network complexity—can be the difference between containing an incident and losing control of the situation.

Another serious problem is that your detection logic might not catch the delayed events, with the risk of missing important alerts. And Machine Learning jobs might suffer from the same problem, missing out important information and as a result potential anomalies.

The Alternative: Direct Agent-to-Cluster Connectivity

Here's the architecture that seems less "secure" but is often far more reliable:

Agents connect directly to the Elasticsearch cluster. Fleet Server runs separately from data processing workloads. Agents can reach Elastic's artifact repositories (or a properly architected centralized repo that doesn't share resources with critical infrastructure).

But wait, doesn't that violate network segmentation principles?

Not necessarily. Modern security architecture is moving away from hard network perimeters toward:

Strong authentication (agents authenticate to Elasticsearch)
Encryption in transit (TLS everywhere)
Least privilege access (agents can only access what they need)
Micro-segmentation at the application layer

With proper network security controls, agents connecting directly to Elasticsearch can be more secure and certainly more reliable than complex multi-hop architectures that create operational burden and potential failure points.

When Segmentation Makes Sense

We're not saying network segmentation is always wrong. There are legitimate scenarios where you need Fleet Servers in separate network zones:

Air-gapped environments where there's literally no connectivity to central infrastructure
Compliance requirements that mandate specific network isolation
Geographically distributed environments where latency to a central cluster is prohibitive
Multi-tenant scenarios where you're providing Elastic Security as a service to separate customers

But in these cases, you need to architect properly:

Separate Fleet Server infrastructure from data processing
Properly load balance Fleet Servers with appropriate tooling
Ensure sufficient resources for control plane responsiveness
Have robust monitoring and troubleshooting procedures
Accept the operational complexity as a necessary trade-off

The Real Question to Ask

Before you implement a segmented Elastic architecture, ask yourself:

What security problem are we actually solving with this segmentation?

If the answer is "we don't want agents connecting directly to our Elasticsearch cluster," dig deeper. Why not? What's the specific threat you're mitigating? Can you mitigate it in a simpler way?

If the answer is genuinely about compliance, air gaps, or true security requirements, then accept the complexity and architect properly. But if it's just "segmentation seems more secure," you might be creating problems that outweigh the benefits.

Lessons Learned

If there's one takeaway from this architectural journey, it's this: complexity is a security risk.

Every additional component is another thing to configure, maintain, monitor, and troubleshoot. Every additional hop in your data flow is another point of potential failure and latency.

Sometimes that complexity is necessary. But often, simpler architectures that are properly secured are more reliable, more performant, and ultimately more secure than complex ones that look good in diagrams but create operational nightmares.

What does your Elastic architecture look like? Are you fighting complexity that doesn't actually improve security?

Designing scalable, performant Elastic Security deployments requires understanding both the platform capabilities and the real-world operational implications of architectural decisions. At Perceptive Security, we help organizations architect Elastic deployments that balance security, performance, and operational sustainability. Let's discuss your deployment architecture and ensure you're set up for success.

Elastic Security Consultancy