Get duplicate files in all SharePoint site using file HASH
Explained everything in the Video : https://youtu.be/WHk2tIav-sQ
The task was to find a way to identify all duplicate files in all SharePoint sites. I searched online for a solution, but none of the scripts I found were accurate. They used different criteria to detect duplicates, such as Name, modified date, size, etc.
After extensive research, I developed the following script that can generate a hash for each file on the SharePoint sites. The hash is a unique identifier that cannot be the same for two files, even if they differ by a single character.
If you want to do the same for only one SharePoint site, you can use below link: Get duplicate files in SharePoint site using file HASH. (itfreesupport.com)
I hope this script will be useful for many people.
Register a new Azure AD Application and Grant Access to the tenant
Register-PnPManagementShellAccess
Then paste and run below pnp script:
Parameters
$TenantURL = “https://tenant-admin.sharepoint.com”
$Pagesize = 2000
$ReportOutput = “C:\Temp\DupSitename.csv”
Connect to SharePoint Online tenant
Connect-PnPOnline $TenantURL -Interactive
Connect-SPOService $TenantURL
Array to store results
$DataCollection = @()
Get all site collections
$SiteCollections = Get-SPOSite -Limit All -Filter “Url -like ‘/sites/‘”
Iterate through each site collection
ForEach ($Site in $SiteCollections)
{
#Get the site URL
$SiteURL = $Site.Url
#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive
#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin ("Site Pages","Style Library", "Preservation Hold Library")}
#Iterate through each document library
ForEach ($Library in $DocumentLibraries)
{
#Get All documents from the library
$global:counter = 0;
$Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
{ Param ($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
"Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
$ItemCounter = 0
#Iterate through each document
Foreach ($Document in $Documents)
{
#Get the File from Item
$File = Get-PnPProperty -ClientObject $Document -Property File
#Get The File Hash
$Bytes = $File.OpenBinaryStream()
Invoke-PnPQuery
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
$DataCollection += $Data
$ItemCounter++
Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
-Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
}
}
}
Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host “Duplicate Files Based on File Hashcode:”
$Duplicates | Format-table -AutoSize
Export the duplicates results to CSV
$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation