Investigating Embedded Metadata

Quiz

In this chapter, we will learn in detail about investigating embedded metadata using Python digital forensics.

Introduction

Embedded metadata is the information about data stored in the same file which is having the object described by that data. In other words, it is the information about a digital asset stored in the digital file itself. It is always associated with the file and can never be separated.

In case of digital forensics, we cannot extract all the information about a particular file. On the other side, embedded metadata can provide us information critical to the investigation. For example, a text files metadata may contain information about the author, its length, written date and even a short summary about that document. A digital image may include the metadata such as the length of the image, the shutter speed etc.

Artifacts Containing Metadata Attributes and their Extraction

In this section, we will learn about various artifacts containing metadata attributes and their extraction process using Python.

Audio and Video

These are the two very common artifacts which have the embedded metadata. This metadata can be extracted for the purpose of investigation.

You can use the following Python script to extract common attributes or metadata from audio or MP3 file and a video or a MP4 file.

Note that for this script, we need to install a third party python library named mutagen which allows us to extract metadata from audio and video files. It can be installed with the help of the following command −

pip install mutagen

Some of the useful libraries we need to import for this Python script are as follows −

from __future__ import print_function

import argparse
import json
import mutagen

The command line handler will take one argument which represents the path to the MP3 or MP4 files. Then, we will use mutagen.file() method to open a handle to the file as follows −

if __name__ == '__main__':
   parser = argparse.ArgumentParser('Python Metadata Extractor')
   parser.add_argument("AV_FILE", help="File to extract metadata from")
   args = parser.parse_args()
   av_file = mutagen.File(args.AV_FILE)
   file_ext = args.AV_FILE.rsplit('.', 1)[-1]
   
   if file_ext.lower() == 'mp3':
      handle_id3(av_file)
   elif file_ext.lower() == 'mp4':
      handle_mp4(av_file)

Now, we need to use two handles, one to extract the data from MP3 and one to extract data from MP4 file. We can define these handles as follows −

def handle_id3(id3_file):
   id3_frames = {'TIT2': 'Title', 'TPE1': 'Artist', 'TALB': 'Album','TXXX':
      'Custom', 'TCON': 'Content Type', 'TDRL': 'Date released','COMM': 'Comments',
         'TDRC': 'Recording Date'}
   print("{:15} | {:15} | {:38} | {}".format("Frame", "Description","Text","Value"))
   print("-" * 85)
   
   for frames in id3_file.tags.values():
      frame_name = id3_frames.get(frames.FrameID, frames.FrameID)
      desc = getattr(frames, 'desc', "N/A")
      text = getattr(frames, 'text', ["N/A"])[0]
      value = getattr(frames, 'value', "N/A")
      
      if "date" in frame_name.lower():
         text = str(text)
      print("{:15} | {:15} | {:38} | {}".format(
         frame_name, desc, text, value))
def handle_mp4(mp4_file):
   cp_sym = u"\u00A9"
   qt_tag = {
      cp_sym + 'nam': 'Title', cp_sym + 'art': 'Artist',
      cp_sym + 'alb': 'Album', cp_sym + 'gen': 'Genre',
      'cpil': 'Compilation', cp_sym + 'day': 'Creation Date',
      'cnID': 'Apple Store Content ID', 'atID': 'Album Title ID',
      'plID': 'Playlist ID', 'geID': 'Genre ID', 'pcst': 'Podcast',
      'purl': 'Podcast URL', 'egid': 'Episode Global ID',
      'cmID': 'Camera ID', 'sfID': 'Apple Store Country',
      'desc': 'Description', 'ldes': 'Long Description'}
genre_ids = json.load(open('apple_genres.json'))

Now, we need to iterate through this MP4 file as follows −

print("{:22} | {}".format('Name', 'Value'))
print("-" * 40)

for name, value in mp4_file.tags.items():
   tag_name = qt_tag.get(name, name)
   
   if isinstance(value, list):
      value = "; ".join([str(x) for x in value])
   if name == 'geID':
      value = "{}: {}".format(
      value, genre_ids[str(value)].replace("|", " - "))
   print("{:22} | {}".format(tag_name, value))

The above script will give us additional information about MP3 as well as MP4 files.

Images

Images may contain different kind of metadata depending upon its file format. However, most of the images embed GPS information. We can extract this GPS information by using third party Python libraries. You can use the following Python script can be used to do the same −

First, download third party python library named Python Imaging Library (PIL) as follows −

pip install pillow

This will help us to extract metadata from images.

We can also write the GPS details embedded in images to KML file, but for this we need to download third party Python library named simplekml as follows −

pip install simplekml

In this script, first we need to import the following libraries −

from __future__ import print_function
import argparse

from PIL import Image
from PIL.ExifTags import TAGS

import simplekml
import sys

Now, the command line handler will accept one positional argument which basically represents the file path of the photos.

parser = argparse.ArgumentParser('Metadata from images')
parser.add_argument('PICTURE_FILE', help = "Path to picture")
args = parser.parse_args()

Now, we need to specify the URLs that will populate the coordinate information. The URLs are gmaps and open_maps. We also need a function to convert the degree minute seconds (DMS) tuple coordinate, provided by PIL library, into decimal. It can be done as follows −

gmaps = "https://www.google.com/maps?q={},{}"
open_maps = "http://www.openstreetmap.org/?mlat={}&mlon={}"

def process_coords(coord):
   coord_deg = 0
   
   for count, values in enumerate(coord):
      coord_deg += (float(values[0]) / values[1]) / 60**count
   return coord_deg

Now, we will use image.open() function to open the file as PIL object.

img_file = Image.open(args.PICTURE_FILE)
exif_data = img_file._getexif()

if exif_data is None:
   print("No EXIF data found")
   sys.exit()
for name, value in exif_data.items():
   gps_tag = TAGS.get(name, name)
   if gps_tag is not 'GPSInfo':
      continue

After finding the GPSInfo tag, we will store the GPS reference and process the coordinates with the process_coords() method.

lat_ref = value[1] == u'N'
lat = process_coords(value[2])

if not lat_ref:
   lat = lat * -1
lon_ref = value[3] == u'E'
lon = process_coords(value[4])

if not lon_ref:
   lon = lon * -1

Now, initiate kml object from simplekml library as follows −

kml = simplekml.Kml()
kml.newpoint(name = args.PICTURE_FILE, coords = [(lon, lat)])
kml.save(args.PICTURE_FILE + ".kml")

We can now print the coordinates from processed information as follows −

print("GPS Coordinates: {}, {}".format(lat, lon))
print("Google Maps URL: {}".format(gmaps.format(lat, lon)))
print("OpenStreetMap URL: {}".format(open_maps.format(lat, lon)))
print("KML File {} created".format(args.PICTURE_FILE + ".kml"))

PDF Documents

PDF documents have a wide variety of media including images, text, forms etc. When we extract embedded metadata in PDF documents, we may get the resultant data in the format called Extensible Metadata Platform (XMP). We can extract metadata with the help of the following Python code −

First, install a third party Python library named PyPDF2 to read metadata stored in XMP format. It can be installed as follows −

pip install PyPDF2

Now, import the following libraries for extracting the metadata from PDF files −

from __future__ import print_function
from argparse import ArgumentParser, FileType

import datetime
from PyPDF2 import PdfFileReader
import sys

Now, the command line handler will accept one positional argument which basically represents the file path of the PDF file.

parser = argparse.ArgumentParser('Metadata from PDF')
parser.add_argument('PDF_FILE', help='Path to PDF file',type=FileType('rb'))
args = parser.parse_args()

Now we can use getXmpMetadata() method to provide an object containing the available metadata as follows −

pdf_file = PdfFileReader(args.PDF_FILE)
xmpm = pdf_file.getXmpMetadata()

if xmpm is None:
   print("No XMP metadata found in document.")
   sys.exit()

We can use custom_print() method to extract and print the relevant values like title, creator, contributor etc. as follows −

custom_print("Title: {}", xmpm.dc_title)
custom_print("Creator(s): {}", xmpm.dc_creator)
custom_print("Contributors: {}", xmpm.dc_contributor)
custom_print("Subject: {}", xmpm.dc_subject)
custom_print("Description: {}", xmpm.dc_description)
custom_print("Created: {}", xmpm.xmp_createDate)
custom_print("Modified: {}", xmpm.xmp_modifyDate)
custom_print("Event Dates: {}", xmpm.dc_date)

We can also define custom_print() method in case if PDF is created using multiple software as follows −

def custom_print(fmt_str, value):
   if isinstance(value, list):
      print(fmt_str.format(", ".join(value)))
   elif isinstance(value, dict):
      fmt_value = [":".join((k, v)) for k, v in value.items()]
      print(fmt_str.format(", ".join(value)))
   elif isinstance(value, str) or isinstance(value, bool):
      print(fmt_str.format(value))
   elif isinstance(value, bytes):
      print(fmt_str.format(value.decode()))
   elif isinstance(value, datetime.datetime):
      print(fmt_str.format(value.isoformat()))
   elif value is None:
      print(fmt_str.format("N/A"))
   else:
      print("warn: unhandled type {} found".format(type(value)))

We can also extract any other custom property saved by the software as follows −

if xmpm.custom_properties:
   print("Custom Properties:")
   
   for k, v in xmpm.custom_properties.items():
      print("\t{}: {}".format(k, v))

The above script will read the PDF document and will print the metadata stored in XMP format including some custom properties stored by the software with the help of which that PDF has been made.

Windows Executables Files

Sometimes we may encounter a suspicious or unauthorized executable file. But for the purpose of investigation it may be useful because of the embedded metadata. We can get the information such as its location, its purpose and other attributes such as the manufacturer, compilation date etc. With the help of following Python script we can get the compilation date, useful data from headers and imported as well as exported symbols.

For this purpose, first install the third party Python library pefile. It can be done as follows −

pip install pefile

Once you successfully install this, import the following libraries as follows −

from __future__ import print_function

import argparse
from datetime import datetime
from pefile import PE

Now, the command line handler will accept one positional argument which basically represents the file path of the executable file. You can also choose the style of output, whether you need it in detailed and verbose way or in a simplified manner. For this you need to give an optional argument as shown below −

parser = argparse.ArgumentParser('Metadata from executable file')
parser.add_argument("EXE_FILE", help = "Path to exe file")
parser.add_argument("-v", "--verbose", help = "Increase verbosity of output",
action = 'store_true', default = False)
args = parser.parse_args()

Now, we will load the input executable file by using PE class. We will also dump the executable data to a dictionary object by using dump_dict() method.

pe = PE(args.EXE_FILE)
ped = pe.dump_dict()

We can extract basic file metadata such as embedded authorship, version and compilation time using the code shown below −

file_info = {}
for structure in pe.FileInfo:
   if structure.Key == b'StringFileInfo':
      for s_table in structure.StringTable:
         for key, value in s_table.entries.items():
            if value is None or len(value) == 0:
               value = "Unknown"
            file_info[key] = value
print("File Information: ")
print("==================")

for k, v in file_info.items():
   if isinstance(k, bytes):
      k = k.decode()
   if isinstance(v, bytes):
      v = v.decode()
   print("{}: {}".format(k, v))
comp_time = ped['FILE_HEADER']['TimeDateStamp']['Value']
comp_time = comp_time.split("[")[-1].strip("]")
time_stamp, timezone = comp_time.rsplit(" ", 1)
comp_time = datetime.strptime(time_stamp, "%a %b %d %H:%M:%S %Y")
print("Compiled on {} {}".format(comp_time, timezone.strip()))

We can extract the useful data from headers as follows −

for section in ped['PE Sections']:
   print("Section '{}' at {}: {}/{} {}".format(
      section['Name']['Value'], hex(section['VirtualAddress']['Value']),
      section['Misc_VirtualSize']['Value'],
      section['SizeOfRawData']['Value'], section['MD5'])
   )

Now, extract the listing of imports and exports from executable files as shown below −

if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT'):
   print("\nImports: ")
   print("=========")
   
   for dir_entry in pe.DIRECTORY_ENTRY_IMPORT:
      dll = dir_entry.dll
      
      if not args.verbose:
         print(dll.decode(), end=", ")
         continue
      name_list = []
      
      for impts in dir_entry.imports:
         if getattr(impts, "name", b"Unknown") is None:
            name = b"Unknown"
         else:
            name = getattr(impts, "name", b"Unknown")
			name_list.append([name.decode(), hex(impts.address)])
      name_fmt = ["{} ({})".format(x[0], x[1]) for x in name_list]
      print('- {}: {}'.format(dll.decode(), ", ".join(name_fmt)))
   if not args.verbose:
      print()

Now, print exports, names and addresses using the code as shown below −

if hasattr(pe, 'DIRECTORY_ENTRY_EXPORT'):
   print("\nExports: ")
   print("=========")
   
   for sym in pe.DIRECTORY_ENTRY_EXPORT.symbols:
      print('- {}: {}'.format(sym.name.decode(), hex(sym.address)))

The above script will extract the basic metadata, information from headers from windows executable files.

Office Document Metadata

Most of the work in computer is done in three applications of MS Office Word, PowerPoint and Excel. These files possess huge metadata, which can expose interesting information about their authorship and history.

Note that metadata from 2007 format of word (.docx), excel (.xlsx) and powerpoint (.pptx) is stored in a XML file. We can process these XML files in Python with the help of following Python script shown below −

First, import the required libraries as shown below −

from __future__ import print_function
from argparse import ArgumentParser
from datetime import datetime as dt
from xml.etree import ElementTree as etree

import zipfile
parser = argparse.ArgumentParser('Office Document Metadata)
parser.add_argument("Office_File", help="Path to office file to read")
args = parser.parse_args()

Now, check if the file is a ZIP file. Else, raise an error. Now, open the file and extract the key elements for processing using the following code −

zipfile.is_zipfile(args.Office_File)
zfile = zipfile.ZipFile(args.Office_File)
core_xml = etree.fromstring(zfile.read('docProps/core.xml'))
app_xml = etree.fromstring(zfile.read('docProps/app.xml'))

Now, create a dictionary for initiating the extraction of the metadata −

core_mapping = {
   'title': 'Title',
   'subject': 'Subject',
   'creator': 'Author(s)',
   'keywords': 'Keywords',
   'description': 'Description',
   'lastModifiedBy': 'Last Modified By',
   'modified': 'Modified Date',
   'created': 'Created Date',
   'category': 'Category',
   'contentStatus': 'Status',
   'revision': 'Revision'
}

Use iterchildren() method to access each of the tags within the XML file −

for element in core_xml.getchildren():
   for key, title in core_mapping.items():
      if key in element.tag:
         if 'date' in title.lower():
            text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
         else:
            text = element.text
         print("{}: {}".format(title, text))

Similarly, do this for app.xml file which contains statistical information about the contents of the document −

app_mapping = {
   'TotalTime': 'Edit Time (minutes)',
   'Pages': 'Page Count',
   'Words': 'Word Count',
   'Characters': 'Character Count',
   'Lines': 'Line Count',
   'Paragraphs': 'Paragraph Count',
   'Company': 'Company',
   'HyperlinkBase': 'Hyperlink Base',
   'Slides': 'Slide count',
   'Notes': 'Note Count',
   'HiddenSlides': 'Hidden Slide Count',
}
for element in app_xml.getchildren():
   for key, title in app_mapping.items():
      if key in element.tag:
         if 'date' in title.lower():
            text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
         else:
            text = element.text
         print("{}: {}".format(title, text))

Now after running the above script, we can get the different details about the particular document. Note that we can apply this script on Office 2007 or later version documents only.

Print Page