Datomic and Content Addressable Techniques: An Ultimate Data Wonderland

Latacora collects and analyzes data about services our clients use. You may have read about our approach to building security tooling, but the tl;dr is we make requests to all the (configuration metadata) read-only APIs available to us and store the results in S3. We leverage the data to understand our clients' infrastructure and identify security issues and misconfigurations. We retain the files (“snapshots”) to support future IR/forensics efforts.

This approach has served us well, but the limited scope of a snapshot meant there was always a problem of first needing to figure out which files to look at. We love aws s3 sync and grep as much as anyone but security analysis requires looking for complex relationships between resources; text search is, at best, only a Bloom filter. What we really wanted was a performant way to ask any question across all the data we have for a client that would support complex queries using logic programming.

Improving connectivity of our data only helps us make the signal-to-noise ratio better by contextualizing vulnerabilities in one area with information from another. Is MFA disabled in Okta for the person who can assume roles with AdministratorAccess in AWS? Is code in that GitHub repo with a compromised supply chain running in production?

Latacora has deep roots in the Clojure community. We’re a frequent conference sponsor and provide financial support for sustainable open source work in the form of Clojurists Together. All this to say, we’ve been following Datomic and databases like it for years. In fact, several of us were in attendance at Clojure/Conj when Datomic was announced as free and we eagerly consumed the recent Jepsen report.

Datomic is a database which stores information as immutable atomic facts and its indices support many access patterns that may be familiar to you from relational, graph, key/value, and columnar databases. Datomic implements queries using Datalog, a logic programming language with many attractive features like implicit joins and recursive evaluation supporting graph traversal. For many people, once you learn querying with Datalog it’s slightly unpleasant to use anything else. We already believed Datomic would be a nice place to end up but getting there meant working through some hard questions.

Down the Rabbit-Hole

Our collection processes run on a scheduled basis and collect everything we can about a service from scratch every time. This usually means resulting files are both large and contain redundant information overlapping other recent collections. Napkin math showed we’d run into exorbitant prices if we tried to accumulate data that way anywhere other than S3. To make storage viable we needed to give up on the approach or figure out how to only store one copy of unique information.

Another problem is our tooling is so dynamic and dependent on the services we analyze at no point could a human reasonably sit down and write out all the attributes we might see. We’re not even sure all major cloud providers can do that for their own services. Even if we could, doing so is antithetical to our data collection philosophy and would mean imposing static constraints on an otherwise open world of data we don’t control. Yet, Datomic is rigorous in a healthy “eat your vegetables” way, so to maintain our open world assumption we’d need to dynamically infer a schema.

Most of Latacora’s tools are batch-oriented: independent tools participating in one or more loosely coupled analysis and reporting pipelines. Even though our data is always stored in S3, we needed to define a standard file format that would work everywhere and help us achieve a uniform query interface.

Through the Looking Glass

We saw a need for a library oriented around writing and reading snapshots that would let us focus on intent and less on mechanism. We wanted the library to do the heavy lifting so existing tools could quickly pick it up and start participating in our vision of more connected data.

Desirable properties / abilities:

Identify a snapshot
Write a snapshot to S3
Read a snapshot from S3
Infer schema as we write data
Append metadata to a snapshot
Append collection data to a snapshot
Identify and deduplicate redundant data
Idempotent loading of snapshots into a database
Organize snapshots in S3 by time to support lifecycle rules
Work with arbitrarily large datasets in a streaming / chunked fashion
Transact snapshots into in-memory databases to support testing and development
Transact snapshots into long-running databases for reporting and ad-hoc questions

Having worked with files containing this data for years we already knew file size would be a problem if we didn’t support streaming from the start. We also suspected we’d need to perform chunked transactions to avoid swamping the transactor with giant transactions. With those constraints in mind we considered the interaction patterns:

Readers

Want to know schema up front before loading other data
Want to read data in chunks and optionally release memory along the way
Want an ability to read metadata at any point

Writers

Want to infer a schema along the way and write it down at the very end
Want to write data in chunks and optionally release memory along the way
Want an ability to write metadata at any point

Acknowledging these roles brought clarity to the design. We’d need separate append-only files for schema, metadata, and collection data. Together those three files could represent a snapshot in a way that is friendly to both streaming readers and writers. We could confidently say schema and metadata files would always be relatively small while still allowing collection data files to grow very large.

We opted to generate a UUIDv7 for every snapshot as the identity and use the embedded timestamp to derive a yyyy/mm/dd/ prefix for S3. To load data into Datomic we’d transact the schema file followed by the collection data file in reasonably sized chunks. The snapshot metadata would be added to every transaction to facilitate data provenance and transaction grouping.

The Cheshire Cat

To deal with duplicate data we borrowed ideas from content addressable storage and began computing a hash for every entity to use as an identity. This probably sounds strange at first: don’t resources in our customers' environments constantly change?

The key observation is that what we actually record in our snapshots is simply data about API requests and responses. As we convert data into datoms, every single map in the original data becomes its own entity. Parent maps turn into entities with reference attributes pointing to their child map entities. There is no direct correspondence between our database entities and what you might consider to be the resources (or entities) of the services we collect data about.

Our entity hash is therefore simply a reliable way to assign an identity to a map of data we’ve seen before. The hash is computed as an unordered combination of hashes from each key/value pair associated with the entity. To achieve unification (deduplication) we mark the attribute containing the hash as a :db.unique/identity attribute in the schema and trust Datomic to correctly wrangle all the pointers during a transaction.

Given our collection approach we expected most of the maps to stay the same over time. Empirically, we now know only 5% of them “change” on a given day. Having achieved a low rate of data accumulation it became both affordable and realistic to transact every single snapshot into a long-running Datomic database for each customer. This also means every entity added by a snapshot contains truly novel information about our customer’s environments.

Datomic records the transactions as entities themselves so the entire history remains queryable. If that sounds powerful, that’s because it is. Leveraging data this way has enabled us to start creating things like networking diagram diffs or heatmaps showing how customer environments evolve over time in just a few hours of coding.

We can still answer point-in-time questions despite the unorthodox immutable entity sharing. This is achieved with a db filter that traverses references for datoms being considered during a query to determine if a shared entity (due to hashing) is in or out of scope when we want time bounded queries. This is possible because we know every snapshot has a root node so there will always be a path from the root to an entity if the entity was part of the snapshot’s observations.

Assigning hash identities to every entity creates idempotency. Transacting the same snapshot twice quite literally does nothing at all. This means we can safely implement retries and automated recovery processes without worrying about famously elusive “exactly once” behavior.

Elegance begets elegance, so it was no surprise when we noticed our design mirrors Clojure’s approach to immutable data structures with structural sharing. If you squint, a snapshot is just a tree described as an append-only log whose nodes might unify onto some existing nodes in the Datomic graph. Each snapshot minimally introduces a new root node to acknowledge observations as of a point in time, but everything else could very well be pointers to things we already knew.

Content Addressable Sharing

Advice From a Caterpillar

It’s instructive to see how the data we write is transformed into triples before being appended to a snapshot. The snapshot APIs are exposed as “reducibles” (reifications of clojure.lang.IReduceInit) because they play nicely with transducers and offer clean incremental APIs even when IO resources are involved. Here we’re doing everything in memory for demo purposes, but the same APIs are available backed by files or network sockets.

You’ll see we compute and inject :content/hash attributes for every entity when we convert the structured data to datoms. We then reuse the hash for the temporary IDs because it simplifies bookkeeping later when partitioning the datoms across multiple transactions.


(require '[com.latacora.snapshot.core :as snap])

; building a snapshot is very easy

(def rw (snap/memory-reader-writer))

(def original-data {:aws {:s3 {:ListBuckets {:response {:Buckets [{:Name "❤️Clojure"}]}}}}})

; prefix keys exists to help us generate unique attribute idents from structured data

(def prefixed-data (snap/prefix-keys original-data))

; =>
#_{:aws
   {:aws/s3
    {:aws.s3/ListBuckets
     {:aws.s3.ListBuckets/response
      {:aws.s3.ListBuckets.response/Buckets
       [{:aws.s3.ListBuckets.response.Buckets/Name "❤️Clojure"}]}}}}}

(snap/write-data-record! rw prefixed-data)


; after you've created or acquired a snapshot, you can inspect the contents

(into [] (snap/schema-reducible rw))

; =>
#_[["$TEMP$:snapshot/metadata" :db/ident :snapshot/metadata]
   ["$TEMP$:snapshot/metadata" :db/valueType :db.type/ref]
   ["$TEMP$:snapshot/metadata" :db/isComponent true]
   ["$TEMP$:snapshot/metadata" :db/cardinality :db.cardinality/one]
   ["$TEMP$:snapshot/roots" :db/ident :snapshot/roots]
   ["$TEMP$:snapshot/roots" :db/isComponent true]
   ["$TEMP$:snapshot/roots" :db/valueType :db.type/ref]
   ["$TEMP$:snapshot/roots" :db/cardinality :db.cardinality/many]
   ["$TEMP$:aws" :db/ident :aws]
   ["$TEMP$:aws" :db/isComponent true]
   ["$TEMP$:aws" :db/valueType :db.type/ref]
   ["$TEMP$:aws" :db/cardinality :db.cardinality/one]
   ["$TEMP$:aws.s3.ListBuckets.response/Buckets" :db/ident :aws.s3.ListBuckets.response/Buckets]
   ["$TEMP$:aws.s3.ListBuckets.response/Buckets" :db/isComponent true]
   ["$TEMP$:aws.s3.ListBuckets.response/Buckets" :db/valueType :db.type/ref]
   ["$TEMP$:aws.s3.ListBuckets.response/Buckets" :db/cardinality :db.cardinality/many]
   ["$TEMP$:aws.s3/ListBuckets" :db/ident :aws.s3/ListBuckets]
   ["$TEMP$:aws.s3/ListBuckets" :db/isComponent true]
   ["$TEMP$:aws.s3/ListBuckets" :db/valueType :db.type/ref]
   ["$TEMP$:aws.s3/ListBuckets" :db/cardinality :db.cardinality/one]
   ["$TEMP$:aws.s3.ListBuckets.response.Buckets/Name" :db/ident :aws.s3.ListBuckets.response.Buckets/Name]
   ["$TEMP$:aws.s3.ListBuckets.response.Buckets/Name" :db/index true]
   ["$TEMP$:aws.s3.ListBuckets.response.Buckets/Name" :db/valueType :db.type/string]
   ["$TEMP$:aws.s3.ListBuckets.response.Buckets/Name" :db/cardinality :db.cardinality/one]
   ["$TEMP$:snapshot/id" :db/ident :snapshot/id]
   ["$TEMP$:snapshot/id" :db/unique :db.unique/identity]
   ["$TEMP$:snapshot/id" :db/valueType :db.type/uuid]
   ["$TEMP$:snapshot/id" :db/cardinality :db.cardinality/one]
   ["$TEMP$:aws.s3.ListBuckets/response" :db/ident :aws.s3.ListBuckets/response]
   ["$TEMP$:aws.s3.ListBuckets/response" :db/isComponent true]
   ["$TEMP$:aws.s3.ListBuckets/response" :db/valueType :db.type/ref]
   ["$TEMP$:aws.s3.ListBuckets/response" :db/cardinality :db.cardinality/one]
   ["$TEMP$:aws/s3" :db/ident :aws/s3]
   ["$TEMP$:aws/s3" :db/isComponent true]
   ["$TEMP$:aws/s3" :db/valueType :db.type/ref]
   ["$TEMP$:aws/s3" :db/cardinality :db.cardinality/one]
   ["$TEMP$:content/hash" :db/ident :content/hash]
   ["$TEMP$:content/hash" :db/valueType :db.type/string]
   ["$TEMP$:content/hash" :db/unique :db.unique/identity]
   ["$TEMP$:content/hash" :db/cardinality :db.cardinality/one]]


; metadata contains the snapshot id and possibly other data contextualizing the collection
; if the snapshot creator chose to add some (e.g. timestamps, permissions, region, account)
(into [] (snap/metadata-reducible rw))

; =>
#_[["datomic.tx" :snapshot/metadata "$TEMP$:0191e251-c4a1-87ba-967f-84bab9c4abff:meta"]
   ["$TEMP$:0191e251-c4a1-87ba-967f-84bab9c4abff:meta" :snapshot/id #uuid"0191e251-c4a1-87ba-967f-84bab9c4abff"]]

; finally, the collection data itself along with all the computed entity hashes

(into [] (snap/data-reducible rw))

; =>
#_[["$TEMP$:0191e251-c4a1-87ba-967f-84bab9c4abff:root" :content/hash "L1$Empl0SWCfM5VqxlGuoF13g=="]
   ["$TEMP$:0191e251-c4a1-87ba-967f-84bab9c4abff:root" :snapshot/roots "$TEMP$:L1$I3BP4J4RgqiSKAsx3Yfseg=="]
   ["$TEMP$:L1$I3BP4J4RgqiSKAsx3Yfseg==" :content/hash "L1$I3BP4J4RgqiSKAsx3Yfseg=="]
   ["$TEMP$:L1$I3BP4J4RgqiSKAsx3Yfseg==" :aws "$TEMP$:L1$QDqyNVa31W9mtE2dR/Ny8g=="]
   ["$TEMP$:L1$QDqyNVa31W9mtE2dR/Ny8g==" :content/hash "L1$QDqyNVa31W9mtE2dR/Ny8g=="]
   ["$TEMP$:L1$QDqyNVa31W9mtE2dR/Ny8g==" :aws/s3 "$TEMP$:L1$7GaqvuJbYJUuqpGgd0/D4A=="]
   ["$TEMP$:L1$7GaqvuJbYJUuqpGgd0/D4A==" :content/hash "L1$7GaqvuJbYJUuqpGgd0/D4A=="]
   ["$TEMP$:L1$7GaqvuJbYJUuqpGgd0/D4A==" :aws.s3/ListBuckets "$TEMP$:L1$Y6tkNE33m+5MpHGN/7O9dQ=="]
   ["$TEMP$:L1$Y6tkNE33m+5MpHGN/7O9dQ==" :content/hash "L1$Y6tkNE33m+5MpHGN/7O9dQ=="]
   ["$TEMP$:L1$Y6tkNE33m+5MpHGN/7O9dQ==" :aws.s3.ListBuckets/response "$TEMP$:L1$KFNi79w6xjD3sqHXzAsUCA=="]
   ["$TEMP$:L1$KFNi79w6xjD3sqHXzAsUCA==" :content/hash "L1$KFNi79w6xjD3sqHXzAsUCA=="]
   ["$TEMP$:L1$KFNi79w6xjD3sqHXzAsUCA==" :aws.s3.ListBuckets.response/Buckets "$TEMP$:L1$EIL6q4pKOxLq6VD0oJAkiw=="]
   ["$TEMP$:L1$EIL6q4pKOxLq6VD0oJAkiw==" :content/hash "L1$EIL6q4pKOxLq6VD0oJAkiw=="]
   ["$TEMP$:L1$EIL6q4pKOxLq6VD0oJAkiw==" :aws.s3.ListBuckets.response.Buckets/Name "❤️Clojure"]]


; if you've used datomic before it's probably clear how you would
; transact these pieces of data into a database. For us, it's just
; a function call away

(require '[datomic.api :as d])

(def db (snap/memory-db rw))

(d/q '[:find ?name .
       :where
       [_ :aws.s3.ListBuckets.response.Buckets/Name ?name]]
     db)

; => "❤️Clojure"

Adventures in Wonderland

Now that we know we can accumulate data from a bunch of tools into the same Datomic database and not worry about rapidly increasing storage costs, we can set our minds towards writing queries. Queries can start leveraging information from multiple tools to produce more specific, correct, and actionable findings. For fun, let’s look at a couple examples:

Query to find all public IPs reachable by the internet

One of the things we love about the attribute namespacing approach is anyone who knows even a little about the service APIs can begin to explore and write queries identifying items of interest or vulnerabilities.

This query finds all the public IPs associated with any network interfaces, and then it checks whether there’s a security group on the interface allowing access from the entire internet.

(d/q '[:find [?ip ...]
       :where
       [?nic-association :aws.ec2.describe-network-interfaces.response.network-interfaces.association/public-ip ?ip]
       [?nic :aws.ec2.describe-network-interfaces.response.network-interfaces/association ?nic-association]
       [?nic :aws.ec2.describe-network-interfaces.response.network-interfaces/groups ?nic-sg]
       [?nic-sg :aws.ec2.describe-network-interfaces.response.network-interfaces.groups/group-id ?group-id]
       [?sg :aws.ec2.describe-security-groups.response.security-groups/group-id ?group-id]
       [?sg :aws.ec2.describe-security-groups.response.security-groups/ip-permissions ?sg-rule]
       (or-join [?sg-rule]
                (and [?sg-rule :aws.ec2.describe-security-groups.response.security-groups.ip-permissions/ipv4ranges ?ipv4-range]
                     [?ipv4-range :aws.ec2.describe-security-groups.response.security-groups.ip-permissions.ipv4ranges/cidr-ip "0.0.0.0/0"])
                (and [?sg-rule :aws.ec2.describe-security-groups.response.security-groups.ip-permissions/ipv6ranges ?ipv6-range]
                     [?ipv6-range :aws.ec2.describe-security-groups.response.security-groups.ip-permissions.ipv6ranges/cidr-ipv6 "::/0"]))]
     db)

Factoring It Out

The previous query probably looks verbose, but it’s unambiguous. We can write shared datalog rules to encapsulate query fragments and improve readability. Here’s the same query factored out as a couple rules for checking security group ingress and discovering or testing whether an IP is publicly accessible. You can imagine how we could make it even more accurate by consulting route tables and prefix lists.

(def rules
  '[[(sg-allows-internet? ?group-id)
     [?sg :aws.ec2.describe-security-groups.response.security-groups/group-id ?group-id]
     [?sg :aws.ec2.describe-security-groups.response.security-groups/ip-permissions ?sg-rule]
     (or-join [?sg-rule]
              (and [?sg-rule :aws.ec2.describe-security-groups.response.security-groups.ip-permissions/ipv4ranges ?ipv4-range]
                   [?ipv4-range :aws.ec2.describe-security-groups.response.security-groups.ip-permissions.ipv4ranges/cidr-ip "0.0.0.0/0"])
              (and [?sg-rule :aws.ec2.describe-security-groups.response.security-groups.ip-permissions/ipv6ranges ?ipv6-range]
                   [?ipv6-range :aws.ec2.describe-security-groups.response.security-groups.ip-permissions.ipv6ranges/cidr-ipv6 "::/0"]))]

    [(internet-accessible? ?ip)
     [?nic-association :aws.ec2.describe-network-interfaces.response.network-interfaces.association/public-ip ?ip]
     [?nic :aws.ec2.describe-network-interfaces.response.network-interfaces/association ?nic-association]
     [?nic :aws.ec2.describe-network-interfaces.response.network-interfaces/groups ?nic-sg]
     [?nic-sg :aws.ec2.describe-network-interfaces.response.network-interfaces.groups/group-id ?group-id]
     (sg-allows-internet? ?group-id)]])

(d/q '[:find [?ip ...]
       :in $ %
       :where
       (internet-accessible? ?ip)]
     db
     rules)

Visualizing AWS resource changes over time with Gource

We’ll spare you the ~50 lines of code for this one, but suffice to say it finds all the AWS identifiers that exist anywhere in the graph and builds a custom Gource changelog showing all the times any of those resources changed. The primary branches are AWS accounts followed by regions, services, resource types, and finally leaves representing individual resources.

Conclusion

We’ve already modified our most complex and important tools to output these snapshots and participate in the graph. We also defined infrastructure to launch a process whenever snapshots land in S3 and immediately ingest their contents into a long-running Datomic transactor. Answering arbitrary questions and searching for vulnerabilities is finally as easy as connecting a client and running queries, as it should be.

In the coming weeks we’ll update our detection and reporting pipelines to be expressed purely in terms of these datalog queries against the unified data instead of analyzing different files for each tool. We expect the new division of responsibilities to allow us to iterate more efficiently on data collection, inferencing, analysis, and reporting mechanisms.

Interested in working with Latacora to apply these power tools to your environment? Say hello@latacora.com.