SIEM Use Case Implementation Mind Map
Monday, September 1, 2014
Building out an organization's security detection capability can be a daunting task. The complexity of the network, number of applications/servers/clients, the sheer number of potential threats, and the unlimited attack avenues those threats can use are only a few of the challenges. To tackle this daunting task there are different ways to build out the detection capability. One of those approaches is to do so leveraging use cases. Use cases are "a logical, actionable and reportable component of an Event Management system." The event management system I kept in mind for this mind map is a SIEM but it may apply to other types of systems. InfoSec Nirvana's post SIEM Use Cases – What you need to know? and Anton Chuvakin's post Detailed SIEM Use Case Example demonstrate how to build a use case and what it should entail. My previous post Linkz for SIEM links to a few more and this paper does as well. In this post I'm walking through how one can take a documented use case and translate that into something actionable to improve an organization's security detection capability.
The process to translate a use case into something actionable can be broken down into four distinct areas: Log Exploration, Custom Rules, Default Rules, and Detect Better Respond Faster. Each area has different activities to complete but there are at least minimum activities to accomplish. This process is illustrated in the mind map below:
The first activity is to take a detailed look at the use case and to determine all of the log sources needed to detect the risk outlined in the use case. This may had been done when the use case was documented but it is still a good activity to repeat to ensure all logs are identified. This involves looking at the risk, the path the risk takes through the network including applications and devices, and determining which device/application contains logs of interest. For example, if the use case is to detect web server attacks then the path is from the threats to the application itself. The devices the threats pass through may include routers, switches, firewalls, IDS systems, proxy servers, web application, and web service. All of which may contain logs of interest.
After the logs have been identified the next activity is to identify what events in those logs are needed. A log can contain a range of events recording numerous items but only some are specific to the use case at hand. This involves doing research on the device/application and possibly setting up a testing environment. For example, if the use case is to detect lateral movement using remote desktop then the log source would be the Windows security event logs (contains authentication events) and the events of interest are those event ids specifically for remote desktop usage.
The actual devices/applications' logging configuration are reviewed to ensure it records the events needed for the use case. Keep in mind, turning on auditing or changing configurations impacts performance so this needs to be tested prior to rolling it out production wide. If performance is significantly impacted then find an alternative method or a happy medium everyone is agreeable to.
Now it is time to make the required configuration changes to bring the logs into the event management system. How this is done depends on the event management system and the source the logs are coming from? At times logs are pushed to event management system such as syslogs while at other times they are pulled into the event management system such Windows event logs through WMI.
After the log(s) are flowing into the event management system it's time to start exploring the logs. Look through the collected logs to not only see what is there and how the events are structured but to see what in the log(s) can be used in a detection rule.
Most event management systems come with default rules and my guess is people start with those. However, I think the better option is to first create custom rule(s) for the use case. The custom rule(s) can incorporate all of the research completed, information from discussions with others, and experience and indicators from previous responses to security incidents. The custom rule(s) are more tailored to the organization and have greater success in detecting the risk outlined in the use case compared to the default rules/signatures. What custom rule to create is solely dependent on the use case? Make sure to leverage all information available and ensure the rule will hit on the item located in the event from the devices/appliances' log(s). After creation, the rule is implemented.
Monitor the implemented custom rule(s) to verify if they produce the desired results. If the rule doesn't hit on any correlated events then test the rule by simulating the activity to make it fire. The custom rule(s) need to provide the exact desired results; if it doesn't then identify how to improve the rules. After rule(s) are updated then monitor again to verify they produce the desired results. Furthermore, the rule(s) need to be tuned to the environment to reduce false positives. At times rule(s) may fire on normal behavior so adjusting rule(s) to not fire on future activity minimizes the noise.
Building out an organization's security detection capability results in activity being detected; thus a response is needed to what is detected. Based on the custom rule(s), establish a triage process to outline how the alert needs to be evaluated to determine: if it's valid and how critical it is. First, evaluate any existing triage processes to see if any apply to these new rules. If there isn't a applicable triage process then create one. The goal is to minimize the number of different triage processes while ensuring there is sufficient triage processes to handle the alerts generated by the rules.
In my opinion establishing triage processes is the second most critical step (detection rules are the first.) Triage is what determines what is accepted as "good" behavior, what needs to be addressed, and what needs to be escalated. After the custom rule(s) are implemented take some time reviewing the rule(s) that fired. Evaluate the activity that triggered the rule and try out different triage techniques. This is repeated until there is a repeatable triage process for the custom rule(s). Continue testing the repeatable triage process to make it more efficient and faster. Look at the false positives and determine if there is a way to identify them sooner in the process? Look at the techniques that require more in-depth analysis and move them to later in the process? The triage process walks a fine line between being as fast as possible and using resources as efficient as possible. Remember, the more time spent on one alarm the less time is spent on others; the less time on one alarm increases malicious activity being missed.
The final triage process is documented so it is repeatable by the entire team.
The final activity is to train the rest of the security detection team on the custom rule(s), how they work, and the triage process to use when they alert on activity. The team are the ones who manage the parts of the use case that have been put in place allowing the remainder activities to be completed.
At this time the default rules in the event management system are reviewed. The only default rules to be concerned about are the ones triggering on activity for the use case of interest. Identify these rules and review their properties to see how they work.
The event management system may of had the default rules enabled but did not alert on them. Depending on the event management system the default rules may need to be enabled. However, ensure the triggered rules do not generate alerts. There is no need to distract the rest of the security detection team with alerts they will just ignore for the time being. Run queries in the event management system to identify any of the default rules who triggered on activity. Explore the triggered rules to see what the activity is and how the activity matches what the rule is looking for. There may be numerous rules which don't trigger on anything; these are addressed in the future as they occur.
Explore the triggered rules to see what the activity is, how it matches what the rule is looking for, and how many generate false positives. Identifying false positives may require triaging a few. Default rules can be very noisy and need to be tuned to the environment. Look at the noisy rules and figure out what can be adjusted to reduce false positives. Make the adjustments and monitor the rules to see if the false positives are reduced. If not, continue making adjustments and monitoring to eliminate the false positives. Some default rules are just too noisy and no amount of tuning will change it; these rules are disabled.
Keep in mind, when tuning rules ensure all the activity from other logs around the time of interest are taken into account. At times one data source may indicate something happened while another shows the activity was blocked.
Establishing and documenting the triage process works the same as it did in the custom rules section. Remember, the more time spent on one alarm the less time is spent on others; the less time on one alarm increases malicious activity being missed. First, evaluate any existing triage processes to see if any apply to these default rules. If there isn't a applicable triage process then create one. The goal is to minimize the number of different triage processes while ensuring there is sufficient triage processes to handle every alert. The final triage process is documented so it is repeatable by the entire team.
The final activity is to train the rest of the security detection team on the default rules, how they work, and the triage process to use when they alert on activity. The team are the ones who manage the parts of the use case that have been put in place allowing the remainder activities to be completed.
Use cases range from having a single rule to numerous rules. Monitor and evaluate the quality of these rules and the coverage they apply to the use case. There are very little models or methods to accomplish this task. Pick a model/method to use or develop one to meet the organization's needs.
The few thought processes I've seen on measuring detection in depth are those by David Bianco. His Pyramid of Pain model is a way to determine the quality of the rules. The higher in the pyramid the better quality it is. Another item to help with determining the quality of rules is a chart provided by Anton Chuvakin in his post SIEM and Badness Detection. Finally, in time the rules that are more accurate at detecting activity will start to stand out from the rest. These are the high quality rules for the use case in question.
The second part of measuring detection in depth is tracking the rules coverage for the use case. David's bed of nails concept where he ties together the pyramid of pain with the kill chain model for detection. David tweeted links to a talk where he discusses this and I'm including them in this post. The video to the Pyramid of Pain: Intel-Driven Detection/Response to Increase Adversary's Cost is located here while the slides are located here.
Over time the organization's network, servers, clients and applications change. These changes will impact the event management system and may produce false positives. Tuning the rules to the environment is an ongoing process so continue to make adjustment to rules as needed.
Rules constantly evolve with existing ones getting updates and new ones implemented; all in an effort to continuously improve an organization's security detection capability. There are two sources of information to use for improvement and one of them are the things learned from triaging and responding to alerts. After each validated alert and security incident the question to ask is: what can be improved upon to make detection better. Was activity missed, can rules be more focused on the activity, is a new rule required, etc.. Each alert is an opportunity for improvement and each day strive to be better than the previous. In my opinion, the best source of intelligence to improve one's detection capabilities is the information gained through response.
The other source of information to use for improvement is intelligence produced by others. This includes a range of items from papers on the latest techniques used by threats to blog posts about what someone is seeing to information shared by others. Some of the information won't apply but the ones that do need to be implemented into the event management system. Again, the goal is to strive to be better than the previous day.
Striving to be better each day is not limited to detection only. The mantra needs to be: Detect Better Respond Faster. After each validated alert and security incident the question to ask is: what can be improved upon to make response faster. Can the triage process be more efficient, are the triage tools adequate, what can make the process faster, etc.. Each time a triage process is completed it's a learning opportunity for improvement. Remember, the more time spent on one alarm the less time is spent on others; the less time on one alarm increases malicious activity being missed. Walk the fine line between speed and efficiency.
Over time the organization's network, servers, clients and applications configurations change. Some implemented rules in the use case are dependent upon those events being present. A simple configuration change can render a rule ineffective thus impacting an organization's security detection capability. It's imperative to periodically review the correlated events in the event management system to see if anything has drastically changed. This is especially true for any custom rules implemented.
Use cases are an effective approach to build out an organization's security detection capability. I walked through how one can take a documented use case and translate that into something actionable to improve an organization's security detection capability. The activities are not all inclusive but they are a decent set of minimum activities to accomplish.
SIEM Use Case Implementation Mind Map
The process to translate a use case into something actionable can be broken down into four distinct areas: Log Exploration, Custom Rules, Default Rules, and Detect Better Respond Faster. Each area has different activities to complete but there are at least minimum activities to accomplish. This process is illustrated in the mind map below:
Logs Exploration
Identify Logs
The first activity is to take a detailed look at the use case and to determine all of the log sources needed to detect the risk outlined in the use case. This may had been done when the use case was documented but it is still a good activity to repeat to ensure all logs are identified. This involves looking at the risk, the path the risk takes through the network including applications and devices, and determining which device/application contains logs of interest. For example, if the use case is to detect web server attacks then the path is from the threats to the application itself. The devices the threats pass through may include routers, switches, firewalls, IDS systems, proxy servers, web application, and web service. All of which may contain logs of interest.
Identify Required Log Events
After the logs have been identified the next activity is to identify what events in those logs are needed. A log can contain a range of events recording numerous items but only some are specific to the use case at hand. This involves doing research on the device/application and possibly setting up a testing environment. For example, if the use case is to detect lateral movement using remote desktop then the log source would be the Windows security event logs (contains authentication events) and the events of interest are those event ids specifically for remote desktop usage.
Confirm Logging Configuration
The actual devices/applications' logging configuration are reviewed to ensure it records the events needed for the use case. Keep in mind, turning on auditing or changing configurations impacts performance so this needs to be tested prior to rolling it out production wide. If performance is significantly impacted then find an alternative method or a happy medium everyone is agreeable to.
Bring In The Logs
Now it is time to make the required configuration changes to bring the logs into the event management system. How this is done depends on the event management system and the source the logs are coming from? At times logs are pushed to event management system such as syslogs while at other times they are pulled into the event management system such Windows event logs through WMI.
Custom Rules
Explore Logs
After the log(s) are flowing into the event management system it's time to start exploring the logs. Look through the collected logs to not only see what is there and how the events are structured but to see what in the log(s) can be used in a detection rule.
Create Custom Rules
Most event management systems come with default rules and my guess is people start with those. However, I think the better option is to first create custom rule(s) for the use case. The custom rule(s) can incorporate all of the research completed, information from discussions with others, and experience and indicators from previous responses to security incidents. The custom rule(s) are more tailored to the organization and have greater success in detecting the risk outlined in the use case compared to the default rules/signatures. What custom rule to create is solely dependent on the use case? Make sure to leverage all information available and ensure the rule will hit on the item located in the event from the devices/appliances' log(s). After creation, the rule is implemented.
Monitor, Improve, and Tune Custom Rules
Monitor the implemented custom rule(s) to verify if they produce the desired results. If the rule doesn't hit on any correlated events then test the rule by simulating the activity to make it fire. The custom rule(s) need to provide the exact desired results; if it doesn't then identify how to improve the rules. After rule(s) are updated then monitor again to verify they produce the desired results. Furthermore, the rule(s) need to be tuned to the environment to reduce false positives. At times rule(s) may fire on normal behavior so adjusting rule(s) to not fire on future activity minimizes the noise.
Establish and Document Triage Process
Building out an organization's security detection capability results in activity being detected; thus a response is needed to what is detected. Based on the custom rule(s), establish a triage process to outline how the alert needs to be evaluated to determine: if it's valid and how critical it is. First, evaluate any existing triage processes to see if any apply to these new rules. If there isn't a applicable triage process then create one. The goal is to minimize the number of different triage processes while ensuring there is sufficient triage processes to handle the alerts generated by the rules.
In my opinion establishing triage processes is the second most critical step (detection rules are the first.) Triage is what determines what is accepted as "good" behavior, what needs to be addressed, and what needs to be escalated. After the custom rule(s) are implemented take some time reviewing the rule(s) that fired. Evaluate the activity that triggered the rule and try out different triage techniques. This is repeated until there is a repeatable triage process for the custom rule(s). Continue testing the repeatable triage process to make it more efficient and faster. Look at the false positives and determine if there is a way to identify them sooner in the process? Look at the techniques that require more in-depth analysis and move them to later in the process? The triage process walks a fine line between being as fast as possible and using resources as efficient as possible. Remember, the more time spent on one alarm the less time is spent on others; the less time on one alarm increases malicious activity being missed.
The final triage process is documented so it is repeatable by the entire team.
Train the Team
The final activity is to train the rest of the security detection team on the custom rule(s), how they work, and the triage process to use when they alert on activity. The team are the ones who manage the parts of the use case that have been put in place allowing the remainder activities to be completed.
Default Rules
Identify Default Rules for Use Case
At this time the default rules in the event management system are reviewed. The only default rules to be concerned about are the ones triggering on activity for the use case of interest. Identify these rules and review their properties to see how they work.
Explore Correlated Default Rules
The event management system may of had the default rules enabled but did not alert on them. Depending on the event management system the default rules may need to be enabled. However, ensure the triggered rules do not generate alerts. There is no need to distract the rest of the security detection team with alerts they will just ignore for the time being. Run queries in the event management system to identify any of the default rules who triggered on activity. Explore the triggered rules to see what the activity is and how the activity matches what the rule is looking for. There may be numerous rules which don't trigger on anything; these are addressed in the future as they occur.
Tune Default Rules
Explore the triggered rules to see what the activity is, how it matches what the rule is looking for, and how many generate false positives. Identifying false positives may require triaging a few. Default rules can be very noisy and need to be tuned to the environment. Look at the noisy rules and figure out what can be adjusted to reduce false positives. Make the adjustments and monitor the rules to see if the false positives are reduced. If not, continue making adjustments and monitoring to eliminate the false positives. Some default rules are just too noisy and no amount of tuning will change it; these rules are disabled.
Keep in mind, when tuning rules ensure all the activity from other logs around the time of interest are taken into account. At times one data source may indicate something happened while another shows the activity was blocked.
Establish and Document Triage Process
Establishing and documenting the triage process works the same as it did in the custom rules section. Remember, the more time spent on one alarm the less time is spent on others; the less time on one alarm increases malicious activity being missed. First, evaluate any existing triage processes to see if any apply to these default rules. If there isn't a applicable triage process then create one. The goal is to minimize the number of different triage processes while ensuring there is sufficient triage processes to handle every alert. The final triage process is documented so it is repeatable by the entire team.
Train the Team
The final activity is to train the rest of the security detection team on the default rules, how they work, and the triage process to use when they alert on activity. The team are the ones who manage the parts of the use case that have been put in place allowing the remainder activities to be completed.
Detect Better Respond Faster
Measure Detection in Depth
Use cases range from having a single rule to numerous rules. Monitor and evaluate the quality of these rules and the coverage they apply to the use case. There are very little models or methods to accomplish this task. Pick a model/method to use or develop one to meet the organization's needs.
The few thought processes I've seen on measuring detection in depth are those by David Bianco. His Pyramid of Pain model is a way to determine the quality of the rules. The higher in the pyramid the better quality it is. Another item to help with determining the quality of rules is a chart provided by Anton Chuvakin in his post SIEM and Badness Detection. Finally, in time the rules that are more accurate at detecting activity will start to stand out from the rest. These are the high quality rules for the use case in question.
The second part of measuring detection in depth is tracking the rules coverage for the use case. David's bed of nails concept where he ties together the pyramid of pain with the kill chain model for detection. David tweeted links to a talk where he discusses this and I'm including them in this post. The video to the Pyramid of Pain: Intel-Driven Detection/Response to Increase Adversary's Cost is located here while the slides are located here.
Continuously Tune Rules
Over time the organization's network, servers, clients and applications change. These changes will impact the event management system and may produce false positives. Tuning the rules to the environment is an ongoing process so continue to make adjustment to rules as needed.
Continuously Improve & Add Rules Based on Response
Rules constantly evolve with existing ones getting updates and new ones implemented; all in an effort to continuously improve an organization's security detection capability. There are two sources of information to use for improvement and one of them are the things learned from triaging and responding to alerts. After each validated alert and security incident the question to ask is: what can be improved upon to make detection better. Was activity missed, can rules be more focused on the activity, is a new rule required, etc.. Each alert is an opportunity for improvement and each day strive to be better than the previous. In my opinion, the best source of intelligence to improve one's detection capabilities is the information gained through response.
Continuously Improve & Add Rules Based on Intel
The other source of information to use for improvement is intelligence produced by others. This includes a range of items from papers on the latest techniques used by threats to blog posts about what someone is seeing to information shared by others. Some of the information won't apply but the ones that do need to be implemented into the event management system. Again, the goal is to strive to be better than the previous day.
Continuously Improve Triage
Striving to be better each day is not limited to detection only. The mantra needs to be: Detect Better Respond Faster. After each validated alert and security incident the question to ask is: what can be improved upon to make response faster. Can the triage process be more efficient, are the triage tools adequate, what can make the process faster, etc.. Each time a triage process is completed it's a learning opportunity for improvement. Remember, the more time spent on one alarm the less time is spent on others; the less time on one alarm increases malicious activity being missed. Walk the fine line between speed and efficiency.
Ensure Logging Configuration
Over time the organization's network, servers, clients and applications configurations change. Some implemented rules in the use case are dependent upon those events being present. A simple configuration change can render a rule ineffective thus impacting an organization's security detection capability. It's imperative to periodically review the correlated events in the event management system to see if anything has drastically changed. This is especially true for any custom rules implemented.
SIEM Use Case Implementation Mind Map Wrap-up
Use cases are an effective approach to build out an organization's security detection capability. I walked through how one can take a documented use case and translate that into something actionable to improve an organization's security detection capability. The activities are not all inclusive but they are a decent set of minimum activities to accomplish.
Spot on! Excellent blog post. I've always talked about creating use cases and how they work but this takes it to the nth degree.