Fake It to Make It: Tips and Tricks for Generating Sample Splunk Data Sets
by Scott DeMoss, on Aug 12, 2016 11:53:53 AM
As you continue to work with Splunk and the number of underlying use cases within your organization grows, you will ultimately encounter a situation where you need to generate some “fake” data. Perhaps you need to create a visualization to use for a proof of concept; perhaps you are trying to master a specific search or visualization; or perhaps you quickly need a few pieces of data for demonstrating a feature to a colleague.
As a Splunk Solution Architect and Consulting Engineer at GTRI, I often make use of synthesized data for all of these reasons and many more. While there are many methods for obtaining sample data for your Splunk needs, in this article I will focus on two methods for creating sample Splunk data sets that do not require any indexing.
Generating Time-series Data for Sample Visualizations
If you’ve worked with Splunk for very long, you quickly realize that users can be VERY particular about the format and appearance of visualizations. The associated search for this example enabled me to quickly generate a few days of hourly data points that I could use to iteratively tweak the colors and chart format for the customer to review.
This search uses a combination of the
chart commands to produce a visual output that can be added to a dashboard prototype.
| gentimes start=07/23/2016 increment=1h | eval myValue=random()%500| eval myOtherValue=random()%300 | eval starttime=strftime(starttime, "%m-%d-%Y %H:%M:%S") | chart max(myValue) AS myValue max(myOtherValue) as myOtherValue over starttime
Let’s break down this search:
gentimes command on its own creates a series of timestamps beginning with the date specified in the
start argument. In this example, I’ve added the
increment argument to further specify the interval for each timestamp (“
1h” or hourly in this case). The net effect is to create 1-hour timestamps up until the current date/time.
The search exports the output of the
gentimes command (hourly timestamps) into a series of two
eval commands that are simply creating two fictitious fields and values to associate with each timestamp that I generated. For these first two
eval commands, I used the random function with the
%<integer> argument to return a random number between 0 and the
<integer> I specified.
The chart command simply outputs my fictitious data into a tabular format that can be used to render visualizations via Splunk’s easy-to-use visualization tools.
Executing the search above lets you quickly generate charts like the one in the screenshot below that can be used for tasks such as modifying simple XML to specify color settings.
Various forms of this command can be used to create visualizations that mimic a data source that a customer uses (or plans on using) but cannot provide. This search can easily be modified to create any number of fields by adding additional
eval statements. Generating a large number of discreet events can be achieved quickly by playing with the start and increment arguments to the
gentimes command. If you have longer term need of the data, you could even write it to an index/summary index.
Creating Tabular Data
In some instances, generating a small set of tabular data may prove useful. Often times I work with customers who want to render Splunk search results in a table with no drilldown. With this quick and simple search, I can generate a small number of results in a tabular format. The search is particularly useful because it creates results with a wide variety of data types: timestamps, counts, string data, numerical data, and both single and multi-value fields.
|noop| makeresults | eval field1 = "abc def ghi jkl mno pqr stu vwx yz" | makemv field1 | mvexpand field1
| eval multiValueField = "cat dog bird" | makemv multiValueField
| streamstats count | eval field2 = random()%100
| eval _time = now() + random()%100 | table _time count field1 field2 multiValueField
At first pass, there appears to be a lot going on here. In reality, it isn’t too complicated.
noop command is listed as a Splunk debugging command. In practice I have only ever used it for generating sample data in scenarios such as this one. In distributed environments, it prevents the search from being sent to the various indexers. The command is used here for the purposes of speed as it basically tells Splunk to complete no operations (i.e., noop) and count the result.
makeresults command is required here because the subsequent
eval command is expecting (and requires) a result set on which to operate or it will raise an error. It creates the specified number of results (or in this case the default number of results which is 1) and passes them to the next pipe in the search.
eval field1 command is creating a text field with the value “abc def ghi … … …”
makemv command converts
field1 from a single value field to a multivalue field by breaking up the values using the default whitespace character as a delimiter
The magic happens with the
mvexpand command. It takes the values of a multivalue field (created with the preceding
makemv command) and creates an individual event for each value. Here, this results in the creation of nine separate events.
eval multiValueField = “cat dog bird” | makemv multiValueField commands simply create an additional field and populate it with multiple values.
streamstats count command is calculating a statistic (in this example, the count of total events) once for each event that is returned in the search. As you can see above with the
mvexpand command, the text string is being expanded into nine total events. Thus, for each event, the streamstats count command adds a field to each event that represents the total number of events returned thus far.
eval field2 creates a fictitious numerical field whose value will be a number between 0 and 100. This is the same technique used in the previous search above.
eval _time = now() + random()%100 creates pseudo-random timestamps for each of the nine events.
table command simply specifies the fields and their order for display.
The net result is the table below. You could also use the chart command to render it as a pie chart or other visualization:
But What If I Need to…
The two techniques discussed in this article are versatile, quick methods that can be used to generate usable samples of data for various purposes. Unfortunately, they won’t cover every conceivable solution. There are certainly times where sample data sets of a specific source and format are the only way to fulfill a request. If you have questions about Splunk data sets, feel free to connect with me on LinkedIn.
Scott DeMoss is a Solution Architect for Data Center and Big Data in Professional Services at GTRI.