|
| 1 | +# Using the shared dataflow library |
| 2 | + |
| 3 | +## File organisation |
| 4 | + |
| 5 | +The files currently live in `semmle/code/python` (whereas the exisitng implementation lives in `semmle/python/dataflow`). |
| 6 | + |
| 7 | +In there is found `DataFlow.qll`, `DataFlow2.qll` etc. which refer to `internal\DataFlowImpl`, `internal\DataFlowImpl2` etc. respectively. The `DataFlowImplN`-files are all identical copies to avoid mutual recursion. They start off by including two files `internal\DataFlowImplCommon` and `internal\DataFlowImplSpecific`. The former contains all the language-agnostic definitions, while the latter is where we describe our favorite language. `Sepcific` simply forwards to two other files `internal/DataFlowPrivate.qll` and `internal/DataFlowPublic.qll`. Definitions in the former will be hidden behind a `private` modifier, while those in the latter can be referred to in data flow queries. For instance, the definition of `DataFlow::Node` should likely be in `DataFlowPublic.qll`. |
| 8 | + |
| 9 | +## Define the dataflow graph |
| 10 | + |
| 11 | +In order to use the dataflow library, we need to define the dataflow graph, |
| 12 | +that is define the nodes and the edges. |
| 13 | + |
| 14 | +### Define the nodes |
| 15 | + |
| 16 | +The nodes are defined in the type `DataFlow::Node` (found in `DataFlowPublic.qll`). |
| 17 | +This should likely be an IPA type, so we can extend it as needed. |
| 18 | + |
| 19 | +Typical cases needed to construct the call graph include |
| 20 | + - argument node |
| 21 | + - parameter node |
| 22 | + - return node |
| 23 | + |
| 24 | +Typical extensions include |
| 25 | + - postupdate nodes |
| 26 | + - implicit `this`-nodes |
| 27 | + |
| 28 | +### Define the edges |
| 29 | + |
| 30 | +The edges split into local flow (within a function) and global flow (the call graph, between functions/procedures). |
| 31 | + |
| 32 | +Extra flow, such as reading from and writing to global variables, can be captured in `jumpStep`. |
| 33 | +The local flow should be obtainalble from an SSA computation. |
| 34 | + |
| 35 | +The global flow should be obtainable from a `PointsTo` analysis. It is specified via `viableCallable` and |
| 36 | +`getAnOutNode`. Consider making `ReturnKind` a singleton IPA type as in java. |
| 37 | + |
| 38 | +If complicated dispatch needs to be modelled, try using the `[reduced|pruned]viable*` predicates. |
| 39 | + |
| 40 | +## Field flow |
| 41 | + |
| 42 | +To track flow through fields we need to provide a model of fields, that is the `Content` class. |
| 43 | + |
| 44 | +Field access is specified via `read_step` and `store_step`. |
| 45 | + |
| 46 | +Work is being done to make field flow handle lists and dictionaries and the like. |
| 47 | + |
| 48 | +`PostUpdateNode`s become important when field flow is used, as they track modifications to fields resulting from function calls. |
| 49 | + |
| 50 | +## Type pruning |
| 51 | + |
| 52 | +If type information is available, flows can be discarded on the grounds of type mismatch. |
| 53 | + |
| 54 | +Tracked types are given by the class `DataFlowType` and the predicate `getTypeBound`, and compatibility is recorded in the predicate `compatibleTypes`. |
| 55 | + |
| 56 | +Further, possible casts are given by the class `CastNode`. |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +# Plan |
| 61 | + |
| 62 | +## Stage I, data flow |
| 63 | + |
| 64 | +### Phase 0, setup |
| 65 | +Define minimal IPA type for `DataFlow::Node` |
| 66 | +Define all required predicates empty (via `none()`), |
| 67 | +except `compatibleTypes` which should be `any()`. |
| 68 | +Define `ReturnKind`, `DataFlowType`, and `Content` as singleton IPA types. |
| 69 | + |
| 70 | + |
| 71 | +### Phase 1, local flow |
| 72 | +Implement `simpleLocalFlowStep` based on the existing SSA computation |
| 73 | + |
| 74 | +### Phase 2, local flow |
| 75 | +Implement `viableCallable` and `getAnOutNode` based on the existing predicate `PointsTo`. |
| 76 | + |
| 77 | +### Phase 3, field flow |
| 78 | +Redefine `Content` and implement `read_step` and `store_step`. |
| 79 | + |
| 80 | +Review use of post-update nodes. |
| 81 | + |
| 82 | +### Phase 4, type pruning |
| 83 | +Use type trackers to obtain relevant type information and redefine `DataFlowType` to contain appropriate cases. Record the type information in `getTypeBound`. |
| 84 | + |
| 85 | +Implement `compatibleTypes` (perhaps simply as the identity). |
| 86 | + |
| 87 | +If necessary, re-implement `getErasedRepr` and `ppReprType`. |
| 88 | + |
| 89 | +If necessary, redefine `CastNode`. |
| 90 | + |
| 91 | +### Phase 5, bonus |
| 92 | +Review possible use of `[reduced|pruned]viable*` predicates. |
| 93 | + |
| 94 | +Review need for more elaborate `ReturnKind`. |
| 95 | + |
| 96 | +Review need for non-empty `jumpStep`. |
| 97 | + |
| 98 | +Review need for non-empty `isUnreachableInCall`. |
| 99 | + |
| 100 | +## Stage II, taint tracking |
| 101 | + |
| 102 | +# Phase 0, setup |
| 103 | +Implement all predicates empty. |
| 104 | + |
| 105 | +# Phase 1, experiments |
| 106 | +Try recovering an existing taint tracking query by implementing sources, sinks, sanitizers, and barriers. |
0 commit comments