home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.os.linux.advocacy      Torvalds farts & fans know what he ate      164,974 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 163,971 of 164,974   
   DFS to All   
   Re: Get to know your files and folders!    
   28 Jan 26 00:15:11   
   
   From: nospam@dfs.com   
      
   Didn't notice you cross-posted and set fup to clp.   Here's the same   
   reply I accidentally sent to clp.   
      
      
   On 1/27/2026 4:02 PM, Lawrence D’Oliveiro wrote:   
   > On Mon, 26 Jan 2026 13:28:24 -0500, DFS wrote:   
   >   
   >> Here's some Python code I wrote to capture file metadata (name,   
   >> location, date created, date modified, and size) in a SQLite   
   >> database.   
   >   
   > I would consider this a waste of time. There are already standard *nix   
   > commands (e.g. du(1) ) for   
   > obtaining this information directly from the filesystem, without the   
   > extra steps of collecting the info in a database and having to keep   
   > that up to date.   
      
   Yeah, I know there are various switches on Linux (and Windows) for   
   examining directories and files, but none that would provide what my   
   program does, as far as I know.   
      
   So unless you can write scripts to summarize your files/folders all the   
   ways I do (as shown in the VIEWs), it's not a waste of time.  And even   
   then, unless the data is in a db, it's nearly useless.  All you can do   
   is scroll it up and down on the screen, and you have to write a new   
   program every time you want a different view.   
      
   Plus once the data is in tables, you can query and sort it and   
   manipulate and scroll it in all kinds of ways.   
      
   What kind of script/commands will show the 50 Largest .zip files, in   
   descending order by size, across a large set of subdirectories?   
      
   Once the data is loaded, in about 2 seconds I can summarize 400K files   
   by extension, with one simple query.   
      
   And I haven't done it yet but I might: pop an Access or PyQt interface   
   onto the tables and find/display/compare/open files far far faster than   
   can be done with a typical file manager.   
      
   Did you run it?   
      
      
   >> Tested on Windows and Linux/WSL.   
   >   
   > But not on native Linux?   
      
   I don't have a native Linux install.   
      
      
   > Because WSL forces the Linux kernel to go   
   > through the filesystem-handling bottleneck that is the Windows kernel.   
      
   WSL1 might have, but WSL2 doesn't.  Today the WSL2 distro lives in and   
   runs from an ext4-formatted .vhdx file located at:   
      
   C:\Users\DFS\AppData\Local\Packages\KaliLinux.54290C8133FEE_ey8k   
   hqnwqnmg\LocalState\ext4.vhdx   
      
   I'd say every command-line code I've ever run (mostly C and Python) runs   
   15% to 150% faster on WSL than on Windows.  This python/db-api program   
   runs at least twice as fast on WSL as on Windows.   
      
      
      
   > Just some thoughts:   
   >   
   >      cSQL =  " CREATE TABLE Files "   
   >      cSQL += " ( "   
   >      cSQL += "   FileID       INTEGER NOT NULL PRIMARY KEY, "   
   >      cSQL += "   FolderID     INTEGER REFERENCES Folders (FolderID), "   
   >      cSQL += "   Folder       TEXT    NOT NULL, "   
   >      cSQL += "   FileName     TEXT    NOT NULL, "   
   >      cSQL += "   FileCreated  NUMBER  NOT NULL, "   
   >      cSQL += "   FileModified NUMBER  NOT NULL, "   
   >      cSQL += "   FileSizeKB   NUMBER  NOT NULL "   
   >      cSQL += " );"   
   >   
   > Did you know Python does implicit string concatenation, like C and   
   > C++?   
      
   I've used other ways here and there, but that style of string-building   
   is my habit.   
      
   other ways that might be a little more efficient:   
      
   cSQL =  (   
   " CREATE TABLE Files "   
   " ( "   
   "   FileID       INTEGER NOT NULL PRIMARY KEY, "   
   "   FolderID     INTEGER REFERENCES Folders (FolderID), "   
   "   Folder       TEXT    NOT NULL, "   
   "   FileName     TEXT    NOT NULL, "   
   "   FileCreated  NUMBER  NOT NULL, "   
   "   FileModified NUMBER  NOT NULL, "   
   "   FileSizeKB   NUMBER  NOT NULL "   
   " );"   
   )   
      
      
   cSQL =  " CREATE TABLE Files \   
   ( \   
     FileID       INTEGER NOT NULL PRIMARY KEY, \   
     FolderID     INTEGER REFERENCES Folders (FolderID), \   
     Folder       TEXT    NOT NULL, \   
     FileName     TEXT    NOT NULL, \   
     FileCreated  NUMBER  NOT NULL, \   
     FileModified NUMBER  NOT NULL, \   
     FileSizeKB   NUMBER  NOT NULL \   
   );"   
      
   Definitely don't like this trailing slashes format.   
      
      
   > Also, I notice you are assuming each file has only one parent folder.   
   > You do know *nix systems are not restricted like this, right?   
      
   I didn't know that, nor have I ever seen it in use (that I knew of).   
      
   How do you assign one file to multiple folders, and then see the   
   multiple parent folders associated with the file?   
      
      
      
   >      filesize   = round(os.path.getsize(root + '/' + file)/1000,1)   
   >      filecreate = os.path.getctime(root + '/' + file)   
   >      filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]   
   >      filemod    = os.path.getmtime(root + '/' + file)   
   >   
   > How many different file-info lookups do you need to do on each file?   
      
   1. file size     (os.path.getsize)   
   2. create date   (os.path.getctime)   
   3. last mod date (os.path.getmtime)   
      
      
      
   > How do you handle symlinks? (Yes, even Windows has those now.)   
      
   now?  You mean for 17 years.   
      
   But I personally don't use symlinks on Windows (I do use shortcuts   
   sometimes), and didn't consider them for this exercise.   
      
      
   > The usual way to get this info is with os.lstat()   
   > , which returns it   
   > all with a single OS call.   
      
      
   I compared them:   
      
   original os.path code:   
     filesize   = round(os.path.getsize(root + '/' + file)/1000,1)   
     filecreate = os.path.getctime(root + '/' + file)   
     filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]   
     filemod    = os.path.getmtime(root + '/' + file)   
     filemod    = str(datetime.datetime.fromtimestamp(filemod))[0:19]   
      
   os.lstat():   
     fileStat   = os.lstat(root + '/' + file)   
     filesize   = round(fileStat.st_size/1000,1)   
     filecreate = fileStat.st_ctime   
     filecreate = str(datetime.datetime.fromtimestamp(filecreate))[0:19]   
     filemod    = fileStat.st_mtime   
     filemod    = str(datetime.datetime.fromtimestamp(filemod))[0:19]   
      
      
              27K files  91K files   
   os.path :  2.3 sec    8.0 sec   
   os.lstat:  1.3 sec    4.7 sec   
   os.stat :  1.2 sec    4.5 sec   
      
   Nearly 2x faster.  Cool.  I'll go with the lstat or stat calls from now on.   
      
      
   >> The major slowdown is one cartesian/update query - used to summarize   
   >> data in all subdirectories - for which I haven't been able to figure   
   >> out a decent workaround.   
   >   
   > As I said, your problem is using a DBMS in the first place.   
      
   It's not a problem.  It's the BEST way.   
      
      
   > You are doing a cross-join of *all* files against *all* folders.   
      
   No, it's all folders against all folders, which works fine for smaller   
   counts (with 91K files and 6800 folders the entire program runs in < 10   
   seconds).  But I have a directory containing nearly 57,000   
   subdirectories, and such a cartesian query would probably never finish   
   on my PC.   
      
   v_Temp   
   ------------------------------------   
   SELECT   
     F1.FOLDERID,   
     F1.FOLDER,   
     Count(F2.SUBFOLDERS) as TotalSF,   
     Sum(F2.FILES)        as TotalFiles,   
     Sum(F2.TotalSizeKB)  as TotalSize   
   FROM   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca