This post is the first in a series about logging and audit trails from a security perspective.
At Latacora, we bootstrap security practices. We partner with companies that frequently have minimally developed security programs, work with them to figure out the right security practices for their current size, and then help them evolve and scale those practices as their business matures.
One thing we always ask new clients about is logging. Logs are the Swiss Army Knife of security tools. They can help us identify threats in real-time, they can be used to investigate incidents as they happen, and they can be used to remediate incidents after-the-fact.
But to use logs as a security tool, your logging systems need to be built with security in mind. Fortunately, we’re here to share the fundamental principles of secure logging and auditing systems. In this series, we’ll share the same advice that we share with our clients to help them build scalable, secure-by-design systems from the ground up.
What do we want out of our logs?
When folks ask us about logging from a security perspective, their questions usually fall into one of two categories: How can we use logs to improve our security posture? How should we manage our logs to reduce our security exposure?
Before we talk about the “how”, we need to square away the “what”: what are the key requirements of a logging system from a security perspective (as opposed to observability for reliability engineering or debugging)?
When responding to security incidents, there are some common types of questions we want to be able to answer:
- “What is the full set of changes that Alice1 made between 2021-04-29 and 2022-01-29?”
- “What is the full set of records that Trudy accessed between 2:30am and 5:30pm yesterday?”
- “How do we reverse all changes made by Bob between 2023-09-21 and 2023-10-05?”
- “Who accessed Zara’s customer data between 2016-03-02 and 2017-06-22?”
We can abstract these questions into a set of invariants for our system. The exact requirements will need to be tailored to your use case, but it might look something like:
- We can enumerate the list of database records that any user accessed or modified during an arbitrary window, going back up to three years (at a minimum).
- We can reverse any changes made by a user during an arbitrary window, going back up to six months (at a minimum).
- For any customer, we can enumerate all users who accessed their data for any window within the last 12 months, and we can do the same with daily resolution for any window within the last 2 years.
- (etc.)
This is an ambitious set of invariants - that’s a good thing! At this early stage in the process, you want to err on the side of the aspirational. As you start planning the implementation, you’ll identify which invariants are achievable immediately, and which ones have to be slotted into a longer-term roadmap. Some invariants might be relaxed for the short-term. For example, it might only be feasible to reverse any changes for the last week in the short-term, and supporting up to six months of reversals will be a longer term goal. Or, for some classes of data, you might decide that it’s only necessary to record the change action, not the actual changes (ie, the old and new values)
When building a logging system from the ground up, the goal isn’t to build the perfect system right away. Instead, you’re defining the end state that you want and taking incremental steps towards that end state. By defining the ideal system in terms of these invariants, we can reframe the broad question of building a secure system into a series of unambiguous declarations that will provide the basis for the security roadmap.
In addition to the things we want from our logs, there are some other things we don’t want out of our logs. We can create some additional invariants:
Someone reading our application logs should not be able to take over any user’s account, either directly (password/secret access), or using support-based account-recovery options. If our application logs are ever compromised, we should not have any special reporting obligations for protected classes of data.
Likewise with the above, we’re framing these invariants generically. Controls around developer access are good but insufficient if an attacker manages to exfiltrate a copy of the application logs, so we will also need controls to prevent certain types of data from entering the logs in the first place.
These invariants are security-specific and mostly focused on the “auditing and incident response” side of logs. While security encompasses more than incident response, incidents are a helpful narrative for approaching security architecture. And of course, logs are useful for other purposes as well, such as observability and debugging. We’re focusing on security, so not going to go into too much detail on how to improve our logs for those use cases, but as you apply this to your system, you can also write invariants that will help you achieve other goals, such as reliability.
What even are logs, anyway?
Now that we have our goals, we’re almost ready to start designing our logging system. But before we do, we should define our terminology, because the word “logs” is very overloaded! Let’s disambiguate:
Audit trails are a high-level security goal. The invariants that we came up with in the previous section are the defining parameters for our systems’ audit trails.
Server logs, request logs, and application logs are strategies that may be used to achieve that goal, in conjunction with other tools. However, server and application logs are mostly used for different purposes: observability, debugging, etc. - and these use cases don’t always overlap with auditing.
It’s okay to use server/application logs and piggyback off of them for auditing purposes, but you still need to think of them as separate-but-overlapping systems, especially as your company grows and your needs expand. Eventually, companies need to develop systems for audit trails that are independent of their logging systems, because the requirements for their audit trails have outgrown logs as an implementation strategy.
From here on out, we’ll use the term “audit trails” to refer to the set of invariants that you have decided on for your system. We’ll use the terms “server logs”, “request logs”, and “application logs” interchangeably, and we’ll distinguish them from “audit trails”.
Comparison chart
Logs | Audit Trails | |||
---|---|---|---|---|
Level of Detail | Very detailed, in-depth messages that capture the full content. So e.g. “POST /login” plus the entire request context, headers, POST data, cookies, etc. | May have less detail, but are more specific (“Alice logged in at 1:42am from 1.2.3.4”) | ||
Instrumentation | Often captured automatically via the framework/stack that you’re using (routing middleware, platform, etc.) | Explicitly created by developers and inserted in key spots in the code - e.g. audit.capture_event("user.logged_in", request.user, request.remote_ip)) |
||
When are they emitted? | Whenever is useful for developers Whenever is necessary for observability/reliability |
Anytime users interact with the system in a way that could have security implications, e.g:
|
||
When are they read? | Real-time or recent history (e.g. “how much traffic are we seeing to /login right now?”) or to debug things that happened recently (“why did so many attempts fail yesterday?”). | Used to to trace what happened days/weeks/months later (“what IP address did Alice log in from November 2022?”) | ||
Retention | Shorter timeframe (e.g. 1 month, plus an additional backup in cold storage) | Years, or indefinitely | ||
Who can access them? | Developers, and typically only developers | Exposed through staff admin panels, and sometimes to users themselves | ||
Storage | Logging platform (ELK, Splunk, etc.) | Often stored in a database, but care must be taken to ensure nonrepudiatability | ||
Access Speed | Fast (used in daily development), though more complicated queries may take longer | Can be slower. However, while some data may be moved to cold storage, audit logs cannot be 100% in cold storage. | ||
Accuracy and validity | Can be used as a shortcut in incident response/remediation, but is not necessarily canonical. Care should still be taken to avoid log injection/forging/ | Treated as canonical. |
Designing a system for logs and audit trails
Your system needs to fit the maturity level and requirements for your company. You don’t want something overengineered and impractical early on. At the same time, you need a system that’s flexible enough to become as robust as you will eventually need it to be.
Let’s look at two archetypes. The first is what clients usually have when they begin working with us, and it’s a good fit for smaller companies still getting their security practices up and running. The second is a very robust and idealized system that manages audit trails independently of logs, and it’s a good fit for more mature companies that need a sophisticated approach to auditing.
In practice, most clients that we work with have something in between these two. Archetype #1 is a convenient starting point for most companies to bootstrap their audit trails, whereas Archetype #2 is a goal that companies can work towards incrementally as their security practices mature.
Archetype 1: “Audit Trails Are Just Fancy Logs”
In this model, you use a logging library that provides custom log levels and supports writing to multiple destinations. The library will write all log events to stderr. Audit-like actions are distinguished with a designated logging level, and the process which slurps the log stream tees audit-level logs to a different downstream aggregator dedicated to audit-like actions.
|
|
Running this, we might get:
|
|
This approach is the easiest to set up - most engineering teams build something like this naturally for the observability benefits alone. But there are some obvious downsides to this approach, and some low-hanging fruit for improvement.
For starters, this example logs an email address directly. Email addresses are generally considered personally-identifying information (PII), which you typically want to keep out of your logs. Depending on how the log stack is set up and how access is controlled, this example may be acceptable for a small company, but it quickly becomes untenable.
Using application-level logging as the basis for audit logs can leave you vulnerable to log injection attacks. In the simplest case, an attacker can generate fake log entries that are indistinguishable from legitimate logs, jeopardizing the integrity of your audit trails.
Defending against log forgery is only half the battle. In addition to making sure that the logs emitted by your application are accurate, you also need to ensure that the audit trails stored by your system are nonrepudiable: once an audit entry has been emitted by an application in the logs, it must be stored without modification. It should be difficult for any entity to tamper with the audit trails by adding, modifying, or deleting entries - including preventing entries from being persisted correctly - and any tampering must be noisy and visible.
Archetype 2: “Audit Trails Are Completely Orthogonal to Logs”
At the other end of the scale, audit trails and logs are completely separate systems that have no awareness of each other at all. Application logs are implemented with any standard log stack (ELK, Splunk, etc.). They still use scrubbing and redaction tools (which we’ll cover later), but otherwise they are completely independent of the auditing systems.
Instead of hooking into logs to provide audit trails, audit trails are integrated with database access (and other state changes). This may be implemented within the database driver or ORM, so that a developer does not need to remember to invoke the audit system when necessary - instead, all database accesses handle auditing invisibly and behind-the-scenes.
Because audit trails are mostly used after-the-fact, this informs several of the design guidelines in the comparison chart. Audit trails can be relatively slow to retrieve or query - particularly for older events - unlike application logs, which are often used for real-time observability needs. Application logs are typically exposed through dedicated logging stacks (e.g. ELK), because general-purpose databases would be inefficient. Audit trails can be stored in traditional application databases, because read-time performance is less of a concern.
For operational simplicity, you may want to store audit trails in the same type of database that’s used for general data store (e.g. MySQL). If you choose this route, it’s a good idea to separate the database instances used for your audit logs from the ones used for general user data. In the event of a breach, you need to be confident that, even if your application database is compromised, you are still able to trust the integrity of the data in your audit trails.
From an implementation standpoint, it’s common to use event streaming platforms to optimize performance. If so, check the delivery guarantees of the platform you are using: many provide “at-most-once” semantics by default - a valid choice for some use cases, but audit logs require “at least once” delivery. In addition, pay close attention to any possible ways that the non-repudiable property of your audit trails can be violated. Who has access to events after they have been generated? After an event has been emitted, can anyone prevent it from being recorded? Questions like these will help you understand the limitations of your auditing system. Audit trails don’t always have to be perfect - what’s most important is making sure the classes of possible errors are well-understood and appropriate for your use case.
This archetype may sound familiar to people used to codebases that implement centralized authorization or object-based authorization. Audit trails can be integrated with the authorization logic, and doing so makes it easier to guarantee that every authorization check or protected object is properly auditable.
This archetype has clear advantages: it is very robust, and - when implemented correctly with appropriate tooling - is straightforward and non-invasive for developer workflows. However, it takes a lot of work to get the system to that point and that’s an investment that most engineering teams can’t budget for all at one. For most teams, it’s better instead to think of this archetype as a desirable end state, and to work incrementally on steps that will help bring the codebase towards this model.
What’s Next
We’ve covered a lot of ground already, so let’s recap:
We began by defining a set of invariants. These statements specify the security guarantees we want to ensure for our system, both in the short term and in the long term. Together, the invariants form the parameters for our audit trails, which are a high-level security goal.
From an implementation standpoint, we’ve seen two different archetypes for implementing audit trails. One relies on application logs, and the other uses a more complex system for recording audit events. In practice, your system will likely be something between the two, involving elements of both. Understanding how both archetypes work will help you make architectural decisions and plan work that will improve your overall security posture incrementally, by moving you towards the more robust archetype.
In the next article in this series, we’ll go over the mechanics of implementing audit trails, using either of these two archetypes. We’ll talk in more detail about what data belongs in your audit trails and logs - and just as importantly, what data doesn’t belong in your audit trails or logs. We’ll also cover managing your audit trails and logs: how to ensure that they are non-repudiable, how to handle access control, how to deal with alerting, and how these topics intersect with security event and incident management (SEIM).
-
If you’re familiar with cryptographic placeholder names, you may notice something with these examples: we don’t distinguish between malicious users and non-malicious users. That’s intentional: it’s impossible to know, a priori, whether a user’s actions are legitimate or the result of a compromised account. ↩︎