我正在尝试将Google Cloud Storage存储桶中的csv文件读取到熊猫数据框中.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)
FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
我做错了什么,我找不到任何不涉及谷歌datalab的解决方案?
解决方法:
UPDATE
截至0.24版本的pandas,read_csv支持直接从Google云端存储中读取.只需提供链接到这样的桶:
df = pd.read_csv('gs://bucket/your_path.csv')
为了完整起见,我还留下了其他三个选项.
我将在下面介绍它们.
我已经写了一些便利功能来从Google存储中读取.为了使其更具可读性,我添加了类型注释.如果你碰巧在Python 2上,只需删除它们,代码将完全相同.
假设您获得授权,它在公共和私人数据集上同样有效.在此方法中,您无需先将数据下载到本地驱动器.
如何使用它:
fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)
代码:
from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account
def get_byte_fileobj(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> BytesIO:
"""
Retrieve data from a given blob on Google Storage and pass it as a file object.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: file object (BytesIO)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return byte_stream
def get_bytestring(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> bytes:
"""
Retrieve data from a given blob on Google Storage and pass it as a byte-string.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: byte-string (needs to be decoded)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
s = blob.download_as_string()
return s
def _get_blob(bucket_name, path, project, service_account_credentials_path):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return blob
gcsfs
gcsfs是“用于Google云端存储的Pythonic文件系统”.
如何使用它:
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
df = pd.read_csv(f)
Dask“为分析提供高级并行性,为您喜爱的工具提供大规模性能”.当您需要在Python中处理大量数据时,它非常棒. dask尝试模仿大部分的pandas API,使其易于用于新手.
这是read_csv
如何使用它:
import dask.dataframe as dd
df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
# df is Now dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。