Use file content heuristics to decide file reader.#1962
Use file content heuristics to decide file reader.#1962Dimi1010 wants to merge 87 commits intoseladb:devfrom
Conversation
…sed on the magic number.
…ics detection method.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1962 +/- ##
==========================================
- Coverage 84.22% 83.83% -0.40%
==========================================
Files 309 313 +4
Lines 55070 55976 +906
Branches 11310 11828 +518
==========================================
+ Hits 46384 46928 +544
- Misses 7556 8225 +669
+ Partials 1130 823 -307
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Tests/Pcap++Test/Tests/FileTests.cpp
Outdated
| PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader)); | ||
| PTF_ASSERT_TRUE(genericReader->open()); | ||
| // ------- IFileReaderDevice::createReader() Factory | ||
| // TODO: Move to a separate unit test. |
There was a problem hiding this comment.
We should add the following to get more coverage:
- Open a snoop file
- Open a file that is not any of the options
- Open pcap files with different magic numbers
- Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
There was a problem hiding this comment.
3d713ab adds the following tests:
- Pcap, PcapNG, Zst file with correct content + extension
- Pcap, PcanNG file with correct content + wrong extension
- Bogus content file with correct extension (pcap, pcapng, zst)
- Bogus content file with wrong extension (txt)
Haven't found a snoop file to add. Do we have any?
Open pcap files with different magic numbers
Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.
Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
Pending on #1962 (comment) .
Move it out if it needs to be reused somewhere.
Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.
|
@Dimi1010 some CI tests fail... |
…in `createReader` instead of having the Format detector assume that is what is intended.
…t detector from libpcap behaviour.
…AII initialization.
…n tryCreateDevice.
…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp # Tests/Pcap++Test/Tests/FileTests.cpp
|
@seladb can we merge this? It has been sitting for a while. |
| }; | ||
|
|
||
| /// @brief Heuristic file format detector that scans the magic number of the file format header. | ||
| class CaptureFileFormatDetector |
There was a problem hiding this comment.
Since we're not parsing all formats (maybe except Zstd) in PcapPlusPlus, we can reuse the logic we already have. Maybe it can run the open() method (or extract a portion of it) for each reader type until it can to find the right type?
There was a problem hiding this comment.
WDYM, we are not parsing all formats? Did you mean "now"?
Also, the necessary logic to detect the file format is already extracted in this class. Tbh, the open() call should probably delegate the format detection to this class if more comprehensive magic number format validation is needed.
IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the createReader device factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping a case statement).
I think integrating the functionality into open() would be suboptimal for the following reasons:
- It potentially adds more responsibilities to the function that just "open the device".
- Looping through all the devices would involve iterating through a loop of more complicated operations.
Constructing the device and possibly repeated file open / close for eachopen()call as it is designed to function independently. - An
open()call can fail for multiple other reasons, not affiliated with the file format specifically.
There was a problem hiding this comment.
WDYM, we are not parsing all formats? Did you mean "now"?
Yes, I meant "now", sorry for the typo 🤦
IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the
createReaderdevice factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping acasestatement).I think integrating the functionality into
open()would be suboptimal for the following reasons:
Having duplicate logic to determine if the file is of a certain format in both the device and CaptureFileFormatDetector is not great because if we fix a bug in one of them, we might miss the other. I think this logic should be in one place: either CaptureFileFormatDetector calls open() (might be the easiest option), or we can extract the detection logic and use it in both places
There was a problem hiding this comment.
Hmm, it should be possible. It will require expanding the CaptureFileFormatDetector a bit. Currently it only returns the format, but pcap for instance uses the magic number to also detect native or swapped byte order.
Depending on how specific we want to get it might involve a double read of the magic number, once by the format detector and once during the actual file header structure read. Impact should be minimal tho, as fstream is buffered by default.
There was a problem hiding this comment.
@seladb Tried a WIP implementation. It is possible to have open() call the format detector, tho I am not perfectly happy with the current iteration I have.
Can we do that merge of functionality in another PR, since those changes would also modify the PcapReader/Writer and SnoopReader and it goes out of scope of this PR?
PS: The WIP API would is something like this:
/// @brief An enumeration representing different capture file formats.
enum class CaptureFileFormat
{
Unknown,
Pcap, // regular pcap with microsecond precision
PcapMod, // Alexey Kuznetzov's "modified" pcap format
PcapNano, // regular pcap with nanosecond precision
PcapNG, // uncompressed pcapng
Snoop, // solaris snoop
ZstArchive, // zstd compressed archive
};
/// @brief Specifies the byte order (endianness) of a capture file relative to the host system.
enum class CaptureFileByteOrder
{
Unknown, // Unknown format. Magic number is palindrome.
Native, // Byte order is native to the host system.
Swapped // Byte order is swapped to the host system.
};
/// @brief Heuristic file format detector that scans the magic number of the file format header.
class CaptureFileFormatDetector
{
public:
/// @brief Checks a content stream for the magic number and determines the type.
///
/// The function optionally detects the byte order of the file if it can be determined by the magic number.
/// The byte order is not updated if no supported format is detected.
///
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return A CaptureFileFormat value with the detected content type.
CaptureFileFormat detectFormat(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
/// @brief Checks a content stream for the magic number and determines if it is a Pcap file.
///
/// The function optionally detects the byte order of the file if it can be determined by the magic number.
/// The byte order is not updated if no supported format is detected.
///
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return A CaptureFileFormat value with the detected Pcap format or Unknown if the file is not pcap.
CaptureFileFormat detectPcapFile(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
/// @brief Checks a content stream for the magic number and determines if it is a PcapNG file.
/// @param[in] content A stream that contains the file content.
/// @return True if the content stream is PcapNG file, false otherwise.
bool isPcapNgFile(std::istream& content) const;
/// @brief Checks a content stream for the magic number and determines if it is a Snoop file.
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return True if the content stream is Snoop file, false otherwise.
bool isSnoopFile(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
/// @brief Checks a content stream for the magic number and determines if it is a Zstd archive.
///
/// The function optionally detects the byte order of the file if it can be determined by the magic number.
/// The byte order is not updated if no supported format is detected.
///
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return True if the content stream is Snoop file, false otherwise.
bool isZstdArchive(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
};There was a problem hiding this comment.
I'm not sure we need CaptureFileFormatDetector if we call open() for each file type.
If we don't want to call open() we can extract the detection logic for each format in a static method, for example:
private:
static bool PcapFileReaderDevice::isPcapFile(const std::ifstream& file, FileTimestampPrecision& precision, bool& needsSwap);
public:
static bool PcapFileReaderDevice::isPcapFile(const std::ifstream& file)
{
return isPcapFile(file, ...);
}
bool PcapFileReaderDevice::open()
{
...
... = isPcapFile(...);
...
}
The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.
The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.
A new function
createReaderandtryCreateReaderhas been added due to changes in the public API of the factory.The functions differ in the error handling scheme, as
createReaderthrows andtryCreateReaderreturnsnullptron error.Method behaviour changes during erroneous scenarios:
getReadercreateReadertryCreateReadernullptrPcapFileDeviceReadernullptr