Who's Online
89 visitors online now
2 guests, 87 bots, 0 members
Support my Sponsor
  • An error has occurred, which probably means the feed is down. Try again later.

Get duplicate files in SharePoint site using file HASH

Explained everything in the Video : https://youtu.be/WHk2tIav-sQ

The task was to find a way to identify all duplicate files in a SharePoint site. I searched online for a solution, but none of the scripts I found were accurate. They used different criteria to detect duplicates, such as Name, modified date, size, etc.

After extensive research, I developed the following script that can generate a hash for each file on the SharePoint sites. The hash is a unique identifier that cannot be the same for two files, even if they differ by a single character.

If you want to do the same for all SharePoint sites, you can use below link:

Get duplicate files in all SharePoint site using file HASH. (itfreesupport.com)

I hope this script will be useful for many people.

Register a new Azure AD Application and Grant Access to the tenant

Register-PnPManagementShellAccess

Then paste and run below pnp script:

$SiteURL = “https://tenant.sharepoint.com/sites/sitename”
$Pagesize = 2000
$ReportOutput = “C:\Temp\DupSitename.csv”

Connect to SharePoint Online site

Connect-PnPOnline $SiteURL -Interactive

Array to store results

$DataCollection = @()

Get all Document libraries

$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq “DocumentLibrary” -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin (“Site Pages”,”Style Library”, “Preservation Hold Library”)}

Iterate through each document library

ForEach ($Library in $DocumentLibraries)
{
#Get All documents from the library
$global:counter = 0;
$Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock { Param ($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity
“Getting Documents from Library ‘$($Library.Title)'” -Status “Getting Documents data $global:Counter of $($Library.ItemCount)”;} | Where {$_.FileSystemObjectType -eq “File”}
$ItemCounter = 0

#Iterate through each document
Foreach ($Document in $Documents)
{
    #Get the File from Item
    $File = Get-PnPProperty -ClientObject $Document -Property File

    #Get The File Hash
    $Bytes = $File.OpenBinaryStream()
    Invoke-PnPQuery
    $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
    $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))

    #Collect data
    $Data = New-Object PSObject
    $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
    $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
    $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
    $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
    $DataCollection += $Data
    $ItemCounter++
    Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
    -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
}

}

Get Duplicate Files by Grouping Hash code

$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host “Duplicate Files Based on File Hashcode:”
$Duplicates | Format-table -AutoSize

Export the duplicates results to CSV

$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation

One Response to “Get duplicate files in SharePoint site using file HASH”