Who's Online
44 visitors online now
1 guests, 43 bots, 0 members
Support my Sponsor
  • An error has occurred, which probably means the feed is down. Try again later.

Get duplicate files in all SharePoint site using file HASH

Explained everything in the Video : https://youtu.be/WHk2tIav-sQ

The task was to find a way to identify all duplicate files in all SharePoint sites. I searched online for a solution, but none of the scripts I found were accurate. They used different criteria to detect duplicates, such as Name, modified date, size, etc.

After extensive research, I developed the following script that can generate a hash for each file on the SharePoint sites. The hash is a unique identifier that cannot be the same for two files, even if they differ by a single character.

If you want to do the same for only one SharePoint site, you can use below link: Get duplicate files in SharePoint site using file HASH. (itfreesupport.com)

I hope this script will be useful for many people.

Register a new Azure AD Application and Grant Access to the tenant

Register-PnPManagementShellAccess

Then paste and run below pnp script:

Parameters

$TenantURL = “https://tenant-admin.sharepoint.com”
$Pagesize = 2000
$ReportOutput = “C:\Temp\DupSitename.csv”

Connect to SharePoint Online tenant

Connect-PnPOnline $TenantURL -Interactive
Connect-SPOService $TenantURL

Array to store results

$DataCollection = @()

Get all site collections

$SiteCollections = Get-SPOSite -Limit All -Filter “Url -like ‘/sites/‘”

Iterate through each site collection

ForEach ($Site in $SiteCollections)
{
#Get the site URL
$SiteURL = $Site.Url

#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive

#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin ("Site Pages","Style Library", "Preservation Hold Library")}

#Iterate through each document library
ForEach ($Library in $DocumentLibraries)
{
    #Get All documents from the library
    $global:counter = 0;
    $Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
    { Param ($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
    "Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
    $ItemCounter = 0

    #Iterate through each document
    Foreach ($Document in $Documents)
    {
        #Get the File from Item
        $File = Get-PnPProperty -ClientObject $Document -Property File

        #Get The File Hash
        $Bytes = $File.OpenBinaryStream()
        Invoke-PnPQuery
        $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
        $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))

        #Collect data
        $Data = New-Object PSObject
        $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
        $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
        $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
        $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
    $DataCollection += $Data
    $ItemCounter++
    Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
    -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
}

}
}

Get Duplicate Files by Grouping Hash code

$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host “Duplicate Files Based on File Hashcode:”
$Duplicates | Format-table -AutoSize

Export the duplicates results to CSV

$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation

Comments are closed.