This post is the second in a series about logging and audit trails from a security perspective.
If you’re looking to level up your security practices, logging is a good place to focus your attention. Just as logging is a core pillar of observability, comprehensive audit trails are a core pillar of a strong security program. Logs and audit trails are separate but overlapping concepts, and most companies can improve their security posture by investing in this area.
In the previous article in this series, we established that audit trails are a high-level security goal. We specify the requirements for our audit trails by defining a set of invariants that we want to maintain for our systems. The implementation can happen in a variety of ways, and we discussed two possible archetypes for implementation, with the understanding that most companies actually exist at a transitional stage somewhere between the two extremes.
The first archetype relies wholly on existing logging infrastructure. In this model, audit trails are just fancy logs: developers produce audit trails by logging data, and security teams use developer observability tools to view and work with the audit trails. We call this the Just Fancy Logs archetype.
Archetype #2 treats audit trails as a completely separate concept from logs. Developers produce audit trails through a separate audit trail interface in application code, which stores data in a dedicated storage mechanism, separate from application logs. Security teams working with the audit trail data use dedicated tooling, which is distinct from the tooling that developers use for daily observability needs. We call this the Independent Audit Events archetype.
The Just Fancy Logs archetype has some obvious benefits. It’s so easy to implement that companies often build something like it organically, without consciously thinking about it. Repurposing existing infrastructure also cuts down costs, both monetary costs and the costs of complexity.
But it also has some drawbacks compared to the Independent Audit Events archetype. Printing highly sensitive and personally identifying information in logs that are visible to developers might be convenient when reconstructing a chain of actions during an incident investigation, but it also becomes a liability from a privacy standpoint, making it difficult to ensure data isn’t being misused. Depending on the data and the jurisdiction, giving broad and unchecked access to this data may also run afoul of data protection regulations. And it’s simply not robust. The consistency and durability models of developer log stacks are designed for ease of operation with minimal impact on running services, which is good for reliability engineering, but not ideal for security. It’s acceptable, if unfortunate, if a failed log rotation causes a brief loss in observability data with no user-facing impact. It’s a lot less acceptable if a failed log rotation means your team can’t determine whether an attacker accessed critical resources during a later incident investigation.
These two archetypes are two ends of a spectrum, and for these reasons, companies typically start off close to the Just Fancy Logs archetype, but find that they need the benefits of a system like the Independent Audit Events archetype as they grow. The challenge is that it can be difficult to reach the Independent Audit Events archetype through incremental changes, because it’s not just a matter of beefing up the existing logs - it’s migrating to a different model altogether.
Companies generally don’t start preparing for this transition until they already urgently need it, at which point they try to fill the gap quickly by reaching for commercial software. As we’ll see in the next section, commercial security tools can be fine if integrated appropriately, but if companies aren’t prepared to adapt their development process accordingly, they won’t be able to get the full value from these third party tools. They then will find themselves stuck in the valley of an intermediate stage, frustrated with a system that has the worst of both worlds: the lack of visibility and the insufficient security guarantees of Just Fancy Logs and the excess complexity and expense of Independent Audit Events.
In this article, we’ll present the framework that we use to help clients bolster their security practices by improving their logs and audit trails, moving from Just Fancy Logs towards Independent Audit Events. Independent Audit Events is often accomplished using third-party software called a SIEM, but we’ll also show you how you can implement this model entirely in-house, if you would prefer not to rely on external vendors. Navigating this transitional area can seem challenging, but having the right plan in place makes it a lot easier.
The security space is full of third-party vendor software, and one tool that vendors are often extremely excited to sell you on is Security Information and Event Management software (SIEM). At its core, a SIEM is an immutable and canonical data store that aggregates logs across multiple sources. That’s already a really valuable tool, even before you configure automated alerting and threat detection on top of that (which is the part that they will usually focus on in their marketing materials).
In our experience, developers who work as application or infrastructure engineers are often extremely wary of third-party commercial software, especially when it is clearly aimed at enterprise users (which most SIEMs are), and especially when that software appears to duplicate functionality that they’re already getting from lower-cost or open-source software that they can run in-house.
At first glance, SIEMs seem shaped like general log storage systems. After all, observability data is collected through observability events, logs are just one type of observability event, and “aggregating, correlating, and alerting” on data sounds like what our logging stack is designed to do. And that’s right - but that’s not the whole story.
As we discussed in the last post, we want different things out of our audit trails. The properties that define good audit trails are different from the properties that define good developer logs. In some cases, such as durability and access controls, they are in direct conflict with one another. Even though there’s some overlap, it’d be impractical - and expensive - to overload our observability tools for security.
Beyond the expense and ease-of-use, an even more fundamental issue is the security model commonly used by observability platforms. By design, developers typically have broad access to observability platforms. That access is an asset for software development, but it’s a liability for security analysis. For security purposes, audit trails need to be both nonrepudiable (difficult to forge) and extremely durable (difficult to excise). They also need to have granular records of sensitive transactions which should only be visible to select personnel on a strictly as-needed basis. Can you wrangle developer observability tools to provide all of these properties? Maybe. Will you get frustrated trying to jam a square peg into a round hole and watching your bills spiral out of control in the process? Almost certainly yes.
In order to be useful at scale, a SIEM needs to be optimized for a very different set of workflows, and within a different set of security parameters, compared to your developer logging stack. Much as you wouldn’t use git as a replacement for ZFS, or use rsync as a replacement for Dropbox, you probably don’t want to shoehorn the use cases for a SIEM into your existing tooling. You can’t optimize a tool for all use cases simultaneously, and as we’ll see soon, the requirements for your SIEM are often going to be in direct opposition to the requirements of your developer log stack. That means that you’re going to want a dedicated SIEM, whether that means using a commercial tool or just standing up a separate, independent instance of your logging stack with a completely different configuration, optimized for serving as a SIEM instead of as a developer observability tool.
While there’s no single right answer to this question, there are a couple of different paths that we’ve seen clients take successfully.
The first is to go through a standard vendor acquisition process. This usually means dealing with an enterprise sales process. That can be fine if you have the time, patience, and money for it, but this series is about bootstrapping logging security practices, which means enterprise tools are not always the best fit.
The second is to run an independent instance of your existing logging stack. As an example, if your developers are currently using an ELK stack or Splunk for their developer logs, you create a completely separate ELK or Splunk cluster, configured to provide the appropriate properties for a SIEM - long retention, restricted access, nonrepudiable, etc. Creating a parallel cluster is pretty simple if you’re already using Infrastructure-as-Code to manage your systems (you are using IaC, right?). For many folks, this second approach will be good enough. It adds no operational complexity, and it’s relatively inexpensive.
But we’ve found that some companies don’t have the time or money for enterprise sales processes, or the engineering bandwidth to operate a separate SIEM - even if that SIEM is just an independent instance of the same logging stack they are already using. To help get over the initial hump, we’ve taken to offering a SIEM service to all of our clients. Latacora SIEM, built on top of Panther, gives companies access to an enterprise-level SIEM, regardless of their size or budget.
What we’ve found by putting a SIEM in the hands of companies that would normally be too small to run one: Having a separate tool for audit trail-shaped events triggers a paradigm shift and provides a built-in forcing function, driving companies towards more mature security practices.
Distinguishing application logs from infrastructure logs for a SIEM tool is a key paradigm shift in moving along the spectrum from Just Fancy Logs to Independent Audit Events. Many companies we work with begin their journey towards treating audit trails as a separate concept from developer-focused logs by integrating with a SIEM, even if they begin by initially using developer-focused logs to bootstrap their audit trails.
To start, we recommend skipping application logs entirely and focusing on ingesting infrastructure logs - ie, logs from corporate security, IT, and cloud providers. You’ll get to the application logs eventually, but for now, the top priority should be logs from your identity provider (IdP), cloud provider (e.g. IaaS), IT platform, and version control system.
There are two reasons we recommend this. First, your application logs are already readily accessible in whatever observability tools you use for daily operations. Having them duplicated doesn’t give you as much additional benefit right now as the infrastructure logs will. You can always cross-reference the SIEM data with your application logs if needed. In contrast, if you don’t have a SIEM tool, your infrastructure logs are probably siloed. They’re difficult to query, and it’s even more difficult to correlate events across multiple sources. This will make it frustrating to investigate incidents involving a suspected employee account compromise or a malicious insider1.
But there’s a more important reason that we recommend this distinction. As we mentioned before, application logs are fundamentally different from audit trails. There is some inherent overlap, and application logs are one possible tool for implementing audit trails, but they represent different layers of abstraction. By starting with infrastructure logs, you are requiring that developers who need to create an audit trail event from application-level data use another approach, besides “just write a log line”.
In the purest form of the Just Fancy Logs archetype, audit trails are emitted using the logging library, using a designated logging level. To enable your organization to move in the direction of the Independent Audit Events archetype, the first step is to change the way developers interface with audit trails, and to stop overloading your logging library for this purpose.
Create a separate library for developers to interface with logs, even if that library is small, exporting only one or two new methods. You can implement this library as a wrapper around your logs. It’s fine if “audit.event(…)” is just syntactic sugar for “log.audit(…)”. The way people interact with a tool shapes how they use it. If developers have to interact with audit trails in a different way, they will begin to think of audit trails differently from how they think of logs, and they will begin to use them differently, even if the data all ends up in the same place (for now).
From this point on, there is a clear contract: if any data is essential for auditing, it must be emitted in an explicit audit trail, independent from developer-focused application logs. The application logs are still available as a fallback, but they are no longer considered the first-class entity for security and auditing purposes.
Next, you will set up your SIEM to receive these audit trail events from your logs. Eventually, you will outgrow this implementation of your audit library, and you’ll look at cutting the logs out of the process altogether with other ways to record that data, such as a dedicated database or event streaming platform. This is the transitional phase between Just Fancy Logs and Independent Audit Events, and it’s an area where there’s no one-size-fits all answer, so we work with our clients to figure out the appropriate incremental steps to take.
Infrastructure logs are a good starting point: you probably already have them, and they cover a lot of important ground without any additional work. But business logic is mostly reflected at the application layer, so at some point, you’ll want to integrate events and data from your application into your audit trails as well.
Your infrastructure logs are automatically generated for you from the platforms you’re using, but much like your application-level logs, your application-level audit events need to be built manually.
As a baseline, you’ll want to start by auditing the following pieces of data in your application:
Some of these may already be covered by your infrastructure logs, but there probably are gaps. For example, infrastructure logs will cover authorization grants with external services, but unless your entire authorization system was integrated deeply with AWS Cognito or Google Identity Platform from the start, you likely have implicit authorizations happening within your application that aren’t visible in infrastructure logs.
If you’re unsure whether a given event or action should be captured in an audit trail, ask yourself these questions:
If the answer to any of these questions is “yes”, then ask yourself one more question: “Is there another existing audit trail that would provide all necessary information under all circumstances?” That final clause is key: duplicate trails can add up to a noticeable expense over time, but it’s far worse if an attacker can slip through the cracks between your audit trails. If there’s no other existing audit trail that is a complete duplicate, then you’ve identified a new audit trail event to add.
Knowing which events to record isn’t enough: you need to make sure the data is sufficiently hydrated as well, with all core metadata attached to the event. At a minimum, every event should be easily traced back to:
These pieces of information are essential to capture for all recorded events: both application logs and audit trails.
These four pieces of information belong with every event in an audit trail, but each event will have additional contextual information as well. Audit trails should be emitted every time a user (including non-human users) interacts with the system in a way that could have security implications. Some examples are obvious, like changes to account information (email addresses, passwords, bank account information, etc.), but some are application-specific. Because there are no hard and fast rules here, we typically work with clients to identify the actions that need to be auditable.
When creating request logs, one common pattern is to generate an opaque, unique request ID in a middleware early in the request lifecycle and emit a log line that contains all of this information up-front. Other log lines deeper in the call tree for the same request can include this same request ID instead of repeating all of that information in every line. This pattern, sometimes referred to as a canonical log line, is a powerful and versatile observability tool. (Note that, despite the name, canonical log lines are only canonical from an observability perspective - e.g. for an application developer debugging production requests. From a security perspective, canonical log lines are not authoritative unless additional care has been taken to guarantee nonrepudiation - we’ll talk about that shortly.)
This approach aligns well with the Just Fancy Logs archetype, which makes it a good place to start if you’re just beginning on your audit trail journey. As an added bonus, if you’re not already using canonical log lines as a developer tool, you’ll be improving observability for your developers at the same time.
From a security perspective, a big advantage of canonical log lines is that developers never get in the habit of explicitly logging user IDs, session IDs, or IP addresses in application logs, because they can count on that information being retrievable through the middleware-managed request log line. While these pieces of data are not always sensitive by themselves, they often sit beside sensitive user data that shouldn’t belong in logs. The more developers have to manage this manually, the more likely it is that sensitive data will end up in logs by mistake.
If you are following the Independent Audit Events archetype, you can still use canonical log lines as an observability tool, but you will need to store this information in your auditing datastore as well. For audit trails, however, we recommend storing these four pieces of information with each entry, rather than attempting to deduplicate. This will make it easier to query for information later on, and it will also make your audit trails more tamper-resistant.
When audit trails are stored in a dedicated datastore, audit events can include full change information, either as a diff or as a full archive of the object before mutation. Storing the change information is helpful during incident remediation - even if an attacker is able to compromise the application, as long as the audit trails are intact, they can be used to fully recover the original application state.
But when using logs to bootstrap your audit trails, as in the Just Fancy Logs archetype, you generally can’t include the full change information, because it’s important to limit the amount of personally identifiable information (PII) in log files. There are different tradeoffs to make, depending on your application. For example, sometimes it’s sufficient just to log field names for edit operations (ie, without recording the field values). In some cases, you can either tokenize information or hash it before logging it, although that requires some care to avoid chosen-plaintext attacks. Choosing the right approach for a given application is another area where there isn’t a one-size-fits-all answer, and where we often work with companies to determine the best approach for their needs.
Having detailed data in audit trails and logs is good, but it is possible to have too much of a good thing. Some data shouldn’t be stored in certain places.
Let’s start with logs. Whether you’re using logs to provide your audit trails (as in the Just Fancy Logs archetype) or not, certain information should never go in application logs. For example:
It may be acceptable to write this information in audit trails that are stored in a separate, restricted-access data store, but these should never touch your application logs. Application logs are too broadly accessible, increasing the likelihood of them getting leaked, and making it functionally impossible to audit access after-the-fact. If you find yourself needing to write these types of data to your logs, that’s either a sign of gaps in your observability tools, or a sign that you’re ready to start investing in independent audit trail infrastructure, as discussed earlier.
A common rule of thumb is that “personally-identifiable information (PII) should not be written to logs”. While helpful, that’s unfortunately not the whole story. The term “PII” has different definitions depending on the context - by the strictest and most literal interpretation, that would mean that logs couldn’t contain IP addresses and user IDs!
Because this is inherently contextual, instead of issuing one-size-fits-all guidelines, we work with our clients to identify the types of data that should be considered sensitive for their application and the restrictions that would be appropriate for that data. As general advice, we recommend categorizing the types of data that your application stores into tiers - for example: “sensitive”, “restricted”, and “unrestricted”.
Oftentimes, when sensitive data is inadvertently written to logs, it’s not because a developer logged a field or value that they didn’t know was sensitive - it’s because a developer added a field or value to another structure, without realizing that structure would be logged. For example, a developer adds a new field to a query, not realizing that the query is included in a GET request that is later logged. Or a new field is added to an object, and that object happens to be logged at a different place in the code. Code review can prevent some of this, but because that’s not enough by itself, we also recommend implementing a scrubbing layer.
If you’re using independent infrastructure for audit trails, as you typically are with the Independent Audit Events archetype, the question of what data you can safely record in audit trails gets a bit more complicated. On the one hand, recording rich audit events every time a resource is mutated enables sophisticated guarantees, such as being able to roll back arbitrary sets of changes, in case a breach is discovered. On the other hand, the more places that sensitive data is stored, the greater the risk of inadvertent exposure, even if access is tightly controlled.
Ultimately, this is a tradeoff. Figuring out the right way to balance that tradeoff requires threat modeling, which is inherently case-specific. That’s why the very first step we took in designing a system of secure audit trails was to define the invariants that parameterize our system. By starting with the invariants, we keep our decisions grounded in the context of the organization’s broader goals.
This strategy isn’t just limited to logging systems and audit trails - it’s the secret to designing secure systems in any context. Security architecture engineering is, at its core, the practice of making informed technical decisions that balance the risks of different threats in a way that is appropriate for the broader goals of the product or organization.
Regardless of your implementation choice, you need a way to discover security incidents rapidly and respond to them efficiently. Let’s assume that you’re looking to run and manage your entire system in-house, rather than springing for Managed Detection and Response (MDR), in which a third party software-assisted service handles some of this work for you. We’ll also assume that you’re investing in this with the goal of improving your company’s security posture, rather than with compliance goals in mind. (Compliance is an important goal in its own right, but compliance is not security - they are distinct topics). And finally, let’s assume you’ve stood up a SIEM, where you’re aggregating an initial set of logs for which you plan to build alerting.
Alert fatigue is a first-class problem with all alerting systems. Overwhelming your staff with alerts that are unactionable or red herrings will quickly train people to ignore them altogether, defeating the point of the SIEM in the first place. The STAT framework3 can be used to reduce alert fatigue by making sure that all alerts are Supported (the alert is clearly owned by a specific responsible team or individual), Trustworthy (owners have confidence that the false-positive and false-negative rates are appropriate in context), Actionable (the responsible party can take immediate action with no additional analysis or decision-making), and Triaged (grouped with tiers that reflect the urgency of the alert).
If we apply the STAT framework to the alerts created from a SIEM, it highlights another reason that we separate a SIEM from our developer-focused logs: the usage patterns around alerting are different. Application logs are an important observability tool, so it’s common to have alerts set up on logs in order to monitor service-level objectives and other operational needs. But those alerts aren’t always ideal for detecting security-related incidents. Not only will the responsibility of responding to those alerts fall on different teams, but the volume of alerts, the expected mean-time-to-response, and the acceptable rates of precision and recall (false positives and false negatives) will differ as well.
When configuring your alerts for the first time, it can be tempting to err on the side of caution and alert on everything that could be suspicious, “just in case”, but we recommend a more tempered approach. As a rule of thumb, you should aim for 0-1 alerts per week that require attention per policy. This is very different from how developer-oriented and reliability-oriented alerting often works in practice. Many SREs would love a rotation where they’re only paged once per week, to say nothing of non-paging alerts that they need to take action on!
The reason for this difference is that SIEMs and audit trail events are created to solve a fundamentally different problem. Especially for small companies, a SIEM is not a tool that you’re investing in because you expect to be interacting with it regularly. It’s a tool that you’re investing in because you have to have it set up before it’s needed in order to get any value out of it once it is actually needed. Proactively alerting you to a breach as soon as it happens is definitely one way a SIEM can be useful, but in practice, much of the value you get out of a SIEM is from the investigations that it enables you to run after-the-fact. As an illustration, Panther has a tier of alerts which are created in a closed state, in order to create events which enrich future investigations. If you’re used to approaching security-focused logging and audit trails from a developer-driven perspective, it takes some time to get used to this paradigm!
Most people will not be interacting with your SIEM regularly. At a small company, you might end up with only a single person responsible for it, and who only needs to interact with it sporadically. At the same time, on the occasions when it is needed, it’s likely to be a top priority. At Latacora, we’re a big fan of tabletop exercises as a tool for highlighting gaps in policy, procedure, detection, data collection, or institutional knowledge. We regularly help our clients run tabletop exercises to test incident response to various simulated threats. Once you have set up your SIEM and are collecting key data in it, we strongly recommend doing a practice run with a mock incident investigation, to verify that the information you believe your SIEM holds is actually stored and queryable. Frequently, people find that critical data is missing, either due to a broken step in the pipeline or due to incorrect retention policies. That’s something you want to discover now, so that you can fix it immediately, as opposed to discovering it during the investigation for an actual incident.
It’s time to talk about a fundamental property of secure audit trails: nonrepudiation.
In a way, nonrepudiation is the inverse of durability. Durability is the question of whether valid events will be reliably persisted, whereas nonrepudation is the question of whether the events that have persisted are valid.
For our audit trails to be useful to a security team, we must be able to state with extremely high confidence that any event retrieved from our records is authentic and has not been tampered with. If the audit trails show that Bob logged in on December 12th at 6:05AM, then that should mean that Bob (or someone who has hijacked Bob’s account) logged in at that time.
Nonrepudation can be violated several ways, the simplest of which is a log forging attack, such as the following:
|
|
This will produce legitimate-looking, forged log lines:
|
|
In this simple example, accepting arbitrary user input provides malicious users an opportunity to generate fake log entries that are completely indistinguishable from legitimate ones. In extreme cases, log forging can even lead to log injection attacks, in which an attacker can achieve remote code execution (RCE) through forged logs.
Much as using a well-designed SQL client library will reduce the risk of writing queries that are vulnerable to SQL injection, a well-designed log library will reduce the risk of writing logs that are vulnerable to log forging attacks. The risk is not zero, however, even when using widely-used and well-regarded libraries. Ultimately, the Just Fancy Logs archetype carries some inherent risk of forged logs. This risk can be reduced, but you’ll hit diminishing returns at some point, and by that point you should probably be thinking of migrating towards some of the more sophisticated setups described earlier, such as an event-streaming platform, which are more aligned with the Independent Audit Events archetype. While not immune to forging attacks either, the barrier to executing a successful attack with those setups is much higher.
Another common way that nonrepudiation can be violated - with either archetype - is if attackers are able to directly tamper with the data store. To address this, you’ll need to implement strong access controls around each step of your audit trail pipeline.
Access control for audit trails can seem like a paradox. On the one hand, audit trails contain some of the most sensitive information in your systems, so you want to lock them down as much as possible. On the other hand, if audit trails are generated before every action that accesses or updates important state, then most services will need direct or indirect access. How do we square this circle?
The implementation details will be specific to your system, but keeping these three principles in mind will guide you towards sensible and secure design decisions:
Whether you’re using the Just Fancy Logs archetype, the Independent Audit Events archetype, or - more likely - something in between, you can implement these principles for secure access control.
Under the Just Fancy Logs archetype, applications inherently can write directly to audit trails, so you will need some defense against log poisoning (which will likely include templating layers and data validation, similar to defenses against SQL injection). But those applications won’t have the ability to read or modify audit trails after they have been emitted.
Under the Independent Audit Events archetype, applications can only write audit trails using an intermediate layer, by streaming events which are then ingested by the SIEM. You can use this as a trust boundary, making it difficult for an attacker to tamper with audit trails, and adding metadata to label events with their origin and source, or additional threat intelligence data such as IP geolocation metadata or identities associated with known IP ranges.
In this two-part series, we’ve laid out the framework that Latacora uses to help companies establish security practices from scratch and then scale those into mature security programs.
Even within the context of audit trails and security logging, there’s a lot that we do that we didn’t get to cover. Having a framework is one thing, but applying it within the context and particulars of an existing organization can be tricky. Much of what we do follows from this framework, but actual on-the-ground implementations vary widely, driven by each company’s unique needs.
On top of that, building a system is only part of the story: operating it on an ongoing basis requires continued work with its own challenges. Areas where companies often run into issues and seek guidance include data scrubbing (preventing forbidden data from entering logs), redaction (dealing with the situation where forbidden data has entered logs), and incident response (actually using your audit trails to successfully investigate and remediate an incident). If you’re interested in hearing more about these topics, please reach out; we’d love to hear from you. We help our customers work on these problems all the time, so we’re always happy to share our thoughts on these topics.
When building security practices for a growing company, it’s important to remember that everything is always a work in progress, and you’re simultaneously designing for the solutions that you need today as well as what you will need tomorrow. The tools and infrastructure you need at various stages in your company’s growth are all very different. Not just in terms of scale: the shape and organization of problems (and the ways to address them) are different. That means there isn’t going to be a linear path from one to the next. That said, knowing what the beginning and end points need to look, and what invariants need to be held at all points along the way, will help you make good decisions in the intermediate stages.
As we discussed in the previous article in this series, an employee account compromise and a malicious insider are a priori indistinguishable at the beginning of an investigation. ↩︎
Audit trails are often used to expand the scope of an incident by looking for trails of an attacker, but a robust audit trail system can also be used to reduce the scope of an incident, which can be just as valuable. For example, if all access to a given service is gated by a bastion service that produces immutable and durable audit trails, then a lack of authorization grants can be evidence that an attacker was not able to access that service. Drawing outer bounds around the incident can be just as useful as establishing the inner bounds.. ↩︎
For more information on the STAT framework, see the talk “Warning: This Talk Contains Content Known to the State of CA to Reduce Alert Fatigue”. ↩︎