Parsing rulesets

Parsing a ruleset from a file is as easy as this.

import yaramod

y = yaramod.Yaramod(yaramod.Features.AllCurrent)
yara_file = y.parse_file('/opt/ruleset.yar')
print(yara_file.text)

You can alternatively also parse from memory.

import yaramod

y = yaramod.Yaramod(yaramod.Features.AllCurrent)
yara_file = y.parse_string(r'''
rule abc {
    condition:
        true
}
''')
print(yara_file.text)

Rules

You can iterate over all rules in the file.

for rule in yara_file.rules:
    print(rule.name)
    print(f'  Global: {rule.is_global}')
    print(f'  Private: {rule.is_private}')

Metas

You can also access meta information of each rule

for rule in yara_file.rules:
    for meta in rule.metas:
        if meta.value.is_string:
            print('String meta: ', end='')
        elif meta.value.is_int:
            print('Int meta: ', end='')
        elif meta.value.is_bool:
            print('Bool meta: ', end='')
        print(f'{meta.key} = {meta.value.pure_text}')

Variables

You can iterate over local variables available for each rule, see their identifier, type and value.

for rule in yara_file.rules:
    for variable in rule.variables:
        if variable.value is str:
            print('Plain string: ', end='')
        elif variable.value is int:
            print('Integer: ', end='')
        elif variable.value is float:
            print('Double: ', end='')
        elif variable.value is bool:
            print('Boolean: ', end='')
        print(f'{variable.key} = {variable.value.text}')

Strings

Iterating over available strings is also possible and you can distinguish which kind of string you are dealing with.

for rule in yara_file.rules:
    for string in rule.strings:
        if string.is_plain:
            print('Plain string: ', end='')
        elif string.is_hex:
            print('Hex string: ', end='')
        elif string.is_regexp:
            print('Regexp: ', end='')
        print(f'{string.identifier} = {string.text}')
        print(f'  ascii: {string.is_ascii}')
        print(f'  wide: {string.is_wide}')
        print(f'  nocase: {string.is_nocase}')
        print(f'  fullword: {string.is_fullword}')
        print(f'  private: {string.is_private}')
        print(f'  xor: {string.is_xor}')
        print(f'  base64: {string.is_base64}')
        print(f'  base64wide: {string.is_base64_wide}')

Condition

There are 2 ways you can look at the condition. The first one is that you just care about the textual representation of the condition and you don’t care about the contents. That one is pretty straightforward.

for rule in yara_file.rules:
    print(rule.condition.text)

The second way is that you care about the contents of the condition and you would like to perform some kind of analysis over the condition. This part is a bit tricky because the hierarchy of the whole condition is unknown to you so you would have to write a lot of recursive algorithms or other kinds of traversals on abstract syntax tree of your condition. To ease this all, we have adopted similar approach as LLVM and provide you with an option to use visitor design pattern to perform the traversal.

Note

If you are not faimilar with this kind of design pattern, just imagine that there are several types of expressions and statements that can be in the condition (integers, logical operations, arithmetic operations, …). You want to perform your operation on all of them, taking their type into account. With visitor design pattern, you just define your operation for each type of expression or statement and that’s it. You then visit each node of abstract syntax tree with your operation which is performed there.

Condition visitors

Let’s say we want to print each function that is in called in the rule condition.

class FunctionCallDumper(yaramod.ObservingVisitor):
    def visit_FunctionCallExpression(self, expr):
        print('Function call: {}'.format(expr.function.text))
        # Visit arguments because they can contain nested function calls
        for arg in expr.arguments:
            arg.accept(self)

Note

As you can see, visitors depend heavily on recursion and that can represent problems sometimes with a huge rulesets where depth of AST is rather large. Python has a limit on how many stack frames you can have at the sime time in order to prevent stack overflow. This limit can be however sometimes very limiting and set too low for certain huge conditions. You might need to run sys.setrecursionlimit to process those.

Expression types

Each expression type has its own unique id (uid). These uids are unique only within scope of a single rule, this allows to identify specific node in the AST for extra processing. There are a lot of expression types that you can visit. Here is a list of them all:

String expressions

  • StringExpression - reference to string in strings section ($a01, $sa02, $str, …)

  • StringWildcardExpression - reference to multiple strings using wildcard ($a*, $*, …)

  • StringAtExpression - refers to $str at <offset>

  • StringInRangeExpression - refers to $str in (<offset1> .. <offset2>)

  • StringCountExpression - reference to number of matched string of certain string identifier (#a01, #str)

  • StringOffsetExpression - reference to first match offset (or Nth match offset) of string identifier (@a01, @a01[N])

  • StringLengthExpression - reference to length of first match (or Nth match) of string identifier (!a01, !a01[N])

Unary operations

All of these provide method getOperand() (operand in Python) to return operand of an expression.

  • NotExpression - refers to logical not operator (not @str > 10)

  • UnaryMinusExpression - refers to unary - operator (-20)

  • PercentualExpression - refers to unary % operator (20%)

  • BitwiseNotExpression - refers to bitwise not (~uint8(0x0))

Binary operations

All of these provide methods getLeftOperand() and getRightOperand() (left_operand and right_operand in Python) to return both operands of an expression.

  • AndExpression - refers to logical and ($str1 and $str2)

  • OrExpression - refers to logical or ($str1 or $str2)

  • LtExpression - refers to < operator ($str1 < $str2)

  • GtExpression - refers to > operator ($str1 > $str2)

  • LeExpression - refers to <= operator (@str1 <= $str2)

  • GeExpression - refers to >= operator (@str1 >= @str2)

  • EqExpression - refers to == operator (!str1 == !str2)

  • NeqExpression - refers to != operator (!str1 != !str2)

  • ContainsExpression - refers to contains operator (pe.sections[0].name contains "text")

  • MatchesExpression - refers to matches operator (pe.sections[0].name matches /(text|data)/)

  • IequalsExpression - refers to iequals operator (pe.sections[0].name iequals "text")

  • IcontainsExpression - refers to icontains operator (pe.sections[0].name icontains "text")

  • EndsWithExpression - refers to endswith operator (pe.sections[0].name endswith "text")

  • IendsWithExpression - refers to iendswith operator (pe.sections[0].name iendswith "text")

  • StartsWithExpression - refers to startswith operator (pe.sections[0].name startswith "text")

  • IstartsWithExpression - refers to istartswith operator (pe.sections[0].name istartswith "text")

  • PlusExpression - refers to + operator (@str1 + 0x100)

  • MinusExpression - refers to - operator (@str1 - 0x100)

  • MultiplyExpression - refers to * operator (@str1 * 0x100)

  • DivideExpression - refers to \ operator (@str1 \ 0x100)

  • ModuloExpression - refers to % operator (@str1 % 0x100)

  • BitwiseXorExpression - refers to ^ operator (uint8(0x10) ^ uint8(0x20))

  • BitwiseAndExpression - refers to & operator (pe.characteristics & pe.DLL)

  • BitwiseOrExpression - refers to | operator (pe.characteristics | pe.DLL)

  • ShiftLeftExpression - refers to << operator (uint8(0x10) << 2)

  • ShiftRightExpression - refers to >> operator (uint8(0x10) >> 2)

For expressions

All of these provide method getVariable() (variable in Python) to return variable used for iterating over the set of values (can also be any, all or none), getIterable() (iterable in Python) to return an iterated set (can also be them) and getBody() (body in Python) to return the body of a for expression. For OfExpression, the getBody() method always returns nullptr (None in Python).

  • ForDictExpression - refers to for which operates on dictionary (for all k, v in some_dict : ( ... ))

  • ForArrayExpression - refers to for which operates on array or set of integers (for all section in pe.sections : ( ... ))

  • ForStringExpression - refers to for which operates on set of string identifiers (for all of ($str1, $str2) : ( ... ))

  • OfExpression - refers to of (all of ($str1, $str2) or all of ($str1, $str2) in (filesize-500..filesize) or any of ($str1, $str2) at 0)

Identificator expressions

All of these provide method getSymbol() (symbol in Python) to return symbol of an associated identifier.

  • IdExpression - refers to identifier (rule1, pe)

  • StructAccessExpression - refers to . operator for accessing structure memebers (pe.number_of_sections)

  • ArrayAccessExpression - refers to [] operator for accessing items in arrays (pe.sections[0])

  • FunctionCallExpression - refers to function call (pe.exports("ExitProcess"))

Literal expressions

  • BoolLiteralExpression - refers to true or false

  • StringLiteralExpression - refers to any sequence of characters enclosed in double-quotes ("text")

  • IntLiteralExpression - refers to any integer value be it decimal, hexadecimal or with multipliers (KB, MB) (42, -42, 0x100, 100MB)

  • DoubleLiteralExpression - refers to any floating point value (72.0, -72.0)

Keyword expressions

  • FilesizeExpression - refers to keyword filesize

  • EntrypointExpression - refers to keyword entrypoint

  • AllExpression - refers to keyword all

  • AnyExpression - refers to keyword any

  • NoneExpression - refers to keyword none

  • ThemExpression - refers to keyword them

Other expressions

  • SetExpression - refers to set of either integers or string identifiers ((1,2,3,4,5), ($str*,$1,$2))

  • RangeExpression - refers to range of integers ((0x100 .. 0x200))

  • ParenthesesExpression - refers to expression enclosed in parentheses (((5 + 6) * 30))

  • IntFunctionExpression - refers to special built-in functions (u)int(8|16|32) (uint16(<offset>))

  • RegexpExpression - refers to regular expression (/<regexp>/<mods>)

Includes

YARA language supports inclusion of other files on the filesystem. Path provided in include directive is always relative to the YARA file being parsed. Since yaramod can also parse from memory, relative paths are only allowed when parsing from the actual file.

Whenever yaramod runs into include, it takes the content of included file and starts parsing it as if it was in place of an include. Therefore, included content is merged with all other content in the file. You can distinguish where the rule comes from using a location attribute of the rule.

for rule in yara_file.rules:
    print(f'{rule.name}: {rule.location.file_path}:{rule.location.line_number}')

Yaramod can also provide you with something what YARA doesn’t handle well - including the same file multiple times. If you do this in YARA then you will get error that you have duplicate rules in your ruleset. This is however not something you would like to run into when doing static analyses. You can allow duplicate includes by using

ymod = yaramod.Yaramod()
ymod.parse_file('/path/to/file', yaramod.ParserMode.IncludeGuarded)

Imports

Checking what modules are imported. Keep in mind that imports are merged from all included files.

for module in yara_file.imports:
    print(f'{module.name}')

Tokens

Yaramod provides an interface to access information about the underlying tokens related to a particular object. This information can be used to determine object location within the parsed file.

for rule in yara_file.rules:
    for string in rule.strings:
        start = string.token_first.location.begin
        end = string.token_last.location.end
        print(f'[{start.line}, {start.column}] - [{end.line}, {end.column}]')

Token exposes the Location which consists of two Positions: begin and end. Position represents a position of charater within the parsed file given by line and column. Currently supported token getters are:

Supported token getters

Object

Accessor

Rule

token_first, token_last

Meta

token_key, token_value

String

token_first, token_last, token_id, token_assign