Skip to main content

Lists the files on a drive insanely fast (43 seconds for 1,800,000 files - 600 GB) by converting the $MFT to a pandas DataFrame

Project description

Lists the files on a drive insanely fast (43 seconds for 1,800,000 files - 600 GB) by converting the $MFT to a pandas DataFrame

pip install mft2df

Tested against Windows 10 / Python 3.10 / Anaconda

The list_files_from_drive function can be used by individuals or developers who need to retrieve a list of files from a specified drive. It can be particularly useful for tasks such as file system analysis, data exploration, or building file management utilities.

Advantages of list_files_from_drive:

  • Retrieves file information from the specified drive and returns it as a structured pandas DataFrame, allowing for easy data manipulation and analysis.
  • Supports parsing of the Master File Table (MFT) dump using external utilities (mft.exe https://github.com/makitos666/MFT_Fast_Transcoder -to copy the mft- and mft_dump.exe https://github.com/omerbenamram/mft -to parse the mft-) to extract file metadata.
  • Uses subprocess calls to execute external commands in a hidden window, providing a seamless user experience.
  • Parses the output of the MFT dump into a DataFrame using pandas, enabling efficient data handling and processing.
  • Performs data type conversions and date parsing for specific columns, ensuring data consistency and usability.
  • Filters out rows with missing FullPath values to ensure the integrity of the data.
  • Prepends the drive letter to the FullPath column to create a complete file path.
  • Cleans up the temporary MFT dump file after processing.
  • Utilizes efficient memory management by explicitly deleting variables, garbage collection, and low-memory options in the pandas read_csv function.
Args:
    drive (str): The drive letter to retrieve the files from. Default is "c".
    convert_dates (bool): Whether to use pd.to_datetime to convert "FileNameLastModified", "FileNameLastAccess",
                           "FileNameCreated","StandardInfoLastModified","StandardInfoLastAccess","StandardInfoCreated"
                          (Parsing takes about 2x longer, and the resulting DataFrame is about 30% bigger)
Returns:
    pd.DataFrame: A DataFrame containing the list of files retrieved from the drive.
Raises:
    None
    
# Important: you need admin rights!!!!
from mft2df import list_files_from_drive
from time import perf_counter
start = perf_counter()
df=list_files_from_drive(drive= "c")
print(f'Time needed: {perf_counter() - start} for {len(df)} files')
print(df[200060:200066].to_string())

# Time needed: 43.62916430000041 for 1842450 files

#        Signature  EntryId  Sequence  BaseEntryId  BaseEntrySequence  HardLinkCount      Flags  UsedEntrySize  TotalEntrySize  FileSize  IsADirectory  IsDeleted  HasAlternateDataStreams StandardInfoFlags     StandardInfoLastModified       StandardInfoLastAccess          StandardInfoCreated           FileNameFlags         FileNameLastModified           FileNameLastAccess              FileNameCreated                                                                                                                     FullPath
# 200060      FILE   202514         1            0                  0              2  ALLOCATED            672            1024       211         False      False                    False           (empty)  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  c:\Windows\WinSxS\Manifests\amd64_bthmtpenum.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_710d1caf8aa9bb19.manifest
# 200061      FILE   202515         1            0                  0              2  ALLOCATED            664            1024       208         False      False                    False           (empty)  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z       c:\Windows\WinSxS\Manifests\amd64_c_wpd.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_a4c4bcf7ec41f07e.manifest
# 200062      FILE   202516         1            0                  0              2  ALLOCATED            672            1024       207         False      False                    False           (empty)  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.032170Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.022586Z     c:\Windows\WinSxS\Manifests\amd64_wpdcomp.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_78d37c0df7225559.manifest
# 200063      FILE   202517         1            0                  0              2  ALLOCATED            664            1024       207         False      False                    False           (empty)  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z       c:\Windows\WinSxS\Manifests\amd64_wpdfs.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_a09f098927b0c6b9.manifest
# 200064      FILE   202518         1            0                  0              2  ALLOCATED            664            1024       208         False      False                    False           (empty)  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.032699Z      c:\Windows\WinSxS\Manifests\amd64_wpdmtp.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_13d74fb245acf719.manifest
# 200065      FILE   202519         1            0                  0              2  ALLOCATED            672            1024       211         False      False                    False           (empty)  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z    c:\Windows\WinSxS\Manifests\amd64_wpdmtphw.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_52e461d8f91111b2.manifest

Examples

Finds all python files on your HDD that contain the string "ctypes" in less than 2 minutes

import pandas as pd
from PrettyColorPrinter import add_printer # pip install PrettyColorPrinter
add_printer(1)
from mft2df import list_files_from_drive
from time import perf_counter

start = perf_counter()
df = list_files_from_drive(drive="c", convert_dates=False)
print(f"Time needed: {perf_counter() - start} " f"for {len(df)} files")


def get_content(file):
    try:
        with open(file, mode="r", encoding="utf-8") as f:
            data = f.read()

    except Exception:
        data = pd.NA
    return data


dffi = df.loc[
    (df.FullPath.str.endswith(".py")) & (~df.IsDeleted) & (~df.IsADirectory)
].copy()
dffi["FileContent"] = dffi.FullPath.apply(get_content)
dffi = dffi.loc[~dffi["FileContent"].isna()]
ctypesfiles = dffi.loc[dffi.FileContent.str.contains("ctypes")]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mft2df-0.13.tar.gz (464.3 kB view details)

Uploaded Source

Built Distribution

mft2df-0.13-py3-none-any.whl (464.5 kB view details)

Uploaded Python 3

File details

Details for the file mft2df-0.13.tar.gz.

File metadata

  • Download URL: mft2df-0.13.tar.gz
  • Upload date:
  • Size: 464.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for mft2df-0.13.tar.gz
Algorithm Hash digest
SHA256 488e970dae2835f08150c164ed4d978db5e8e79a06367fb1582245f1a2cbd205
MD5 889157131ee77c7a96c14908896d949e
BLAKE2b-256 9777eca58cbb2f4ed12a10c3f621b6eab412e977a48c03f9c9d09fef47f358d1

See more details on using hashes here.

File details

Details for the file mft2df-0.13-py3-none-any.whl.

File metadata

  • Download URL: mft2df-0.13-py3-none-any.whl
  • Upload date:
  • Size: 464.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for mft2df-0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 b269347a0a48c20fd8acb1fdab9c1c28e8c241e7a85d4a8c2ea20d861b0cb4b4
MD5 05ca86175fa6851b99949537900e4637
BLAKE2b-256 7f6302b0244712b114ab267d7a90500526f8c75d9818e57e300312e0809359ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page