CSV指的是逗號分隔值(Comma-Separated Values,CSV,有時成為字符分隔值)。其文件以純文本形式存儲表格數(shù)據(jù)(數(shù)字和文本),文件的每一行都是一個數(shù)據(jù)記錄 很多數(shù)據(jù)文件都以CSV格式保存,因此在使用Python和Matlab面對海量實驗或計算數(shù)據(jù)時,CSV的讀寫速度十分重要。 這次測試的項目是:
首先需要生成測試用的數(shù)據(jù)文件 Python:
import pandas as pd import numpy as np
N = 10 n = 36 m = int(1e6)
for i in range(N): M = np.random.rand(m, n) M_df = pd.DataFrame(M) M_df.to_csv(f"這里是數(shù)據(jù)文件的存儲路徑\dataset{i}.csv")
Matlab
m = 1e6; n = 36; N = 10; for i = 1:N data = rand(m, n); writematrix(data,"dataset" + num2str(i) + ".csv") end
生成的數(shù)據(jù)文件體積達到了600+Mb,應(yīng)該可以代表大部分情況下的數(shù)據(jù)量了。 在Python中使用了數(shù)據(jù)分析常用的第三方庫—Pandas進行文件的讀寫。
import pandas as pd import numpy as np import time
N = 10 read_time = [] write_time = []
for i in range(N): tic = time.time() M = pd.read_csv(f"這里是數(shù)據(jù)文件讀取的路徑\dataset{i}.csv") read_time.append(time.time() - tic) print(f"read dataset{i}.csv")
tic = time.time() M.to_csv(f"這里是數(shù)據(jù)文件寫入的路徑\dataset{i}_out.csv") write_time.append(time.time() - tic) print(f"wrote dataset{i}_out.csv")
print(f"平均讀取用時: {np.mean(read_time)}") print(f"平均寫入用時: {np.mean(write_time)}")
運行結(jié)果為:
clc; clear; close all; my_path = "這里是數(shù)據(jù)文件寫入的路徑\data_out"; data_path = "這里是數(shù)據(jù)文件讀取的路徑\dataset";
N = 10; read_time = zeros(N, 1); write_time = zeros(N, 1); for i = 1:N tic M = readtable(data_path + num2str(i) + ".csv"); read_time(i) = toc; fprintf("read dataset%d.csv\n", i); tic % dont' use csvwrite, it's limited to 5 points of precision writetable(M,my_path + num2str(i) + ".csv"); write_time(i) = toc; fprintf("wrote dataset%d_out.csv\n", i); end
fprintf("平均讀取用時: %f\n", mean(read_time)) fprintf("平均寫入用時: %f\n", mean(write_time))
運行結(jié)果為: 通過對比可以看到: 以上測試結(jié)果受計算機配置及軟件版本的影響可能會有所不同,大家感興趣的話可以在自己的計算機上嘗試一下。 —— end ——
|