GNU libextractor

Shawn Lin's picture

Introduction

GNU libextractor is GNU’s library for extracting meta data from files. Meta data includes format information (such as mime type, image dimensions, color depth, recording frequency), content descriptions (such as document title or document description) and copyright information (such as license, author and contributors). Currently, libextractor supports the following formats: HTML, PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF. Also, various additional MIME types are detected.

Libextractor is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License. GNU libextractor uses plugins to handle various file formats. Technically a plugin can support multiple file formats; however, most plugins only support one particular format. By default, GNU libextractor will use all plugins that are available and found in the plugin installation directory. Applications can request the use of only specific plugins or the exclusion of certain plugins.

Example for using dynamic library

// hello.c
#include <extractor.h>
int main()
{
struct EXTRACTOR_PluginList *el;
el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
// ...
EXTRACTOR_plugin_remove_all (el);
return 0;
}

You can then compile the example using
$ gcc \ -I/Library/Frameworks/Extractor.framework/Versions/Current/include \
-o hello hello.c \
-L/Library/Frameworks/Extractor.framework/Versions/Current/lib \
-lextractor

Plugin management

C Struct: EXTRACTOR_PluginList

A plugin list represents a set of GNU libextractor plugins.

Function: void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins)

Unload all of the plugins in the given list.

Function: struct EXTRACTOR_PluginList * EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name)

Unloads a particular plugin. The given name should be the short name of the plugin, for example “mime” for the mime-type extractor or “mpeg” for the MPEG extractor.

Function: struct EXTRACTOR_PluginList
* EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags)

Loads a particular plugin.

Function: struct EXTRACTOR_PluginList
* EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags)

Loads and unloads plugins based on a configuration string, modifying the existing list, which can be NULL. The string has the format “[-]NAME(OPTIONS){:[-]NAME(OPTIONS)}*”. Prefixing the plugin name with a “-” means that the plugin should be unloaded.

Function: struct EXTRACTOR_PluginList
* EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags)

Loads all of the plugins in the plugin directory.

Example for a minimal extract method

The following example shows how a plugin can return the mime type of a file.

int
EXTRACTOR_mymime_extract
(const char *data,
size_t data_size,
EXTRACTOR_MetaDataProcessor proc,
void *proc_cls,
const char * options)
{
if (data_size < 4)
return 0;
if (0 != memcmp (data, "\177ELF", 4))
return 0;
if (0 != proc (proc_cls,
"mymime",
EXTRACTOR_METATYPE_MIMETYPE,
EXTRACTOR_METAFORMAT_UTF8,
"text/plain",
"application/x-executable",
1 + strlen("application/x-executable")))
return 1;
/* more calls to 'proc' here as needed */
return 0;
}

Internal utility functions

‘convert_numeric.h’ defines various conversion functions for numbers (in particular, byte-order conversion for floating point numbers).

‘unzip.h’ defines an API for accessing compressed files.

‘pack.h’ provides an interpreter for unpacking structs of integer numbers from streams and converting from big or little endian to host byte order at the same time.

‘convert.h’ provides a function for character set conversion described below.

Function: char * EXTRACTOR_common_convert_to_utf8(const char *input, size_t len, const char * charset)
which can be used to easily convert text from any character set to UTF-8.

Conclusion

In short, for us doing frequent message analyses, GNU libextractor not only helped us to find the correct file format, you can also write a variety of plugin tools, analysis of various text formats, find the corresponding MIME type. In terms of email, it can be said to be an indispensable tool.